Buckets:
Title: 1 Introduction
URL Source: https://arxiv.org/html/2510.00073
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Problem Formulation 3Lower Bound and Problem Complexity 4Algorithm and Upper Bound 5Model Misspecification 6Generalized Linear Model 7Numerical Experiments 8Conclusion References
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: informs3.cls
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: arXiv.org perpetual non-exclusive license arXiv:2510.00073v1 [stat.ML] 29 Sep 2025 \OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\NAT@set@cites \RUNTITLE
Identifying All ๐ -Best Arms In Linear Bandits
\TITLE
Identifying All ๐ -Best Arms in (Misspecified) Linear Bandits
\ARTICLEAUTHORS\AUTHOR
Zhekai Li \AFFDepartment of Civil and Environmental Engineering, Massachusetts Institute of Technology, \EMAILzhekaili@mit.edu \AUTHORTianyi Ma \AFFSchool of Operations Research and Information Engineering, Cornell University, \EMAILtm693@cornell.edu \AUTHORCheng Hua \AFFAntai College of Economics and Management, Shanghai Jiao Tong University, \EMAILcheng.hua@sjtu.edu.cn \AUTHORRuihao Zhu \AFFSC Johnson College of Business, Cornell University, \EMAILruihao.zhu@cornell.edu
\ABSTRACT
Motivated by the need to efficiently identify multiple candidates in high trial-and-error cost tasks such as drug discovery, we propose a near-optimal algorithm to identify all ๐ -best arms (i.e., those at most ๐ worse than the optimum). Specifically, we introduce LinFACT, an algorithm designed to optimize the identification of all ๐ -best arms in linear bandits. We establish a novel information-theoretic lower bound on the sample complexity of this problem and demonstrate that LinFACT achieves instance optimality by matching this lower bound up to a logarithmic factor. A key ingredient of our proof is to integrate the lower bound directly into the scaling process for upper bound derivation, determining the termination round and thus the sample complexity. We also extend our analysis to settings with model misspecification and generalized linear models. Numerical experiments, including synthetic and real drug discovery data, demonstrate that LinFACT identifies more promising candidates with reduced sample complexity, offering significant computational efficiency and accelerating early-stage exploratory experiments.
\KEYWORDS
ranking and selection; sequential decision making; simulation; adaptive experiment; model misspecification
1Introduction
This paper addresses the problem of identifying the best set of options from a finite pool of candidates. A decision-maker sequentially selects candidates for evaluation, observing independent noisy rewards that reflect their quality. The goal is to strategically allocate measurement efforts to identify the desired candidates. This problem belongs to the class of pure exploration problems, which fall under the bandit framework but differ from traditional multi-armed bandits (MABs) that balance exploration and exploitation to minimize cumulative regret. Instead, pure exploration focuses on efficient information gathering to confidently apply the chosen best or best set of options. This approach is particularly relevant in applications such as drug discovery and product testing, where identifying the most promising candidates is followed by utilizing them under high-cost conditions, such as clinical trials or large-scale manufacturing tests.
Conventional pure exploration focuses on identifying the optimal candidate, often referred to as the best arm in the bandit setting. However, in many real-world scenarios, candidates with rewards falling slightly below the optimum may later demonstrate advantageous traits, such as fewer side effects, simpler manufacturing processes, or lower resistance during implementation. Motivated by this insight, this paper aims to identify all ๐ -best candidates (i.e., those whose performance is at most ๐ worse than the optimum). This approach is especially valuable when exploring a range of nearly optimal options is necessary. Promoting multiple promising candidates not only mitigates risk but also increases the chances that at least one will prove successful.
This setting captures many applications in real-world scenarios. For example:
โข
Drug Discovery: In drug discovery, pharmaceutical companies aim to identify as many promising drug candidates as possible during preclinical stages. These candidates, known as preclinical candidate compounds (PCCs), are optimized compounds prepared for preclinical testing to assess efficacy, safety, and pharmacokinetics before advancing to clinical trials. Given the inherent risks and high failure rates in subsequent drug development (Das et al. 2021), starting with a larger pool of potential candidates increases the chances of identifying at least one successful, marketable drug.
โข
Assortment Design: In e-commerce (Boyd and Bilegan 2003, Elmaghraby and Keskinocak 2003, Feng et al. 2022), recommender systems (Huang et al. 2007, Peukert et al. 2023), and streaming services (Alaei et al. 2022, Godinho de Matos et al. 2018), expanding the consideration set (e.g., products, movies, or songs) can improve user satisfaction and increase revenues. Offering a diverse range of recommendations helps cater to varying tastes and preferences.
โข
Automatic Machine Learning: In automatic machine learning (AutoML) (Thornton et al. 2013), the goal is to automate the process of selecting algorithms and tuning hyperparameters by providing multiple promising choices for predictive tasks. Due to the randomness of limited data, the best out-of-sample model may not always be optimal. Therefore, providing users with a diverse set of models that yield good results is critical to assisting them in selecting the best algorithm and hyperparameters.
1.1Main Contributions
This paper focuses on identifying all ๐ -best arms in the linear bandit setting and presents contributions in both the algorithmic and theoretical dimensions.
โข
๐ฟ -Probably Approximately Correct (PAC) Algorithm: On the algorithmic front, we introduce LinFACT (Linear Fast Arm Classification with Threshold estimation), a ๐ฟ -PAC (see Section 2.1 for a formal definition) phase-based algorithm for identifying all ๐ -best arms in linear bandit problems. LinFACT demonstrates superior effectiveness compared to existing pure exploration algorithms.
โข
Matching Bound of Sample Complexity: We make two key technical contributions. First, we derive an information-theoretic lower bound on the problem complexity for identifying all ๐ -best arms in linear bandits. To the best of our knowledge, this is the first such result in the literature. Second, we establish two distinct upper bounds on the expected sample complexity of LinFACT, illustrating the differences between various optimal design criteria used in its implementation. Notably, we demonstrate that LinFACT achieves instance optimality when using the ๐ณ โ ๐ด -optimal design criterion, matching the lower bound up to a logarithmic factor. The ๐ณ โ ๐ด -optimal design focuses on contrasting pairs of arms rather than evaluating each arm individually. Our analysis leverages the lower bound directly in defining the classification termination round and in scaling the upper bound.
โข
Accounting for Misspecified Models and GLMs: We extend our framework beyond linear models to handle misspecified linear bandits and generalized linear models (GLMs). For both cases, we provide theoretical upper bounds on the expected sample complexity. Furthermore, we analyze how prior knowledge of model misspecification impacts the algorithmic upper bounds and performance, and how the incorporation of GLMs influences the sample complexity.
โข
Numerical Studies with Real-World Datasets: We conduct extensive numerical experiments to demonstrate that our LinFACT algorithm outperforms existing methods in terms of sample complexity, computational efficiency, and reliable identification of all ๐ -best arms. In experiments with synthetic data, LinFACT outperforms other baselines in both adaptive and static settings. Using a real-world drug discovery dataset (Free and Wilson 1964), we further show that LinFACT achieves superior performance compared to previous algorithms. Notably, LinFACT is computationally efficient with time complexity of ๐ โ ( ๐พ โ ๐ 2 ) , which is lower than ๐ โ ( ๐ โ ๐พ โ ๐ 2 ) of Lazy TTS ( ๐ is a non-negligible number) (Rivera and Tewari 2024), ๐ โ ( ๐พ โ ๐ 3 ) of top ๐ algorithms (Rรฉda et al. 2021a), and ๐ โ ( ๐พ 2 โ log โก ๐พ ) of KGCB (Negoescu et al. 2011).
1.2Related Literature
Pure Exploration. The multi-armed bandits (MABs) model has been a critical paradigm for addressing the exploration-exploitation trade-off since its introduction by Thompson (1933) in the context of medical trials. While much of the research focuses on minimizing cumulative regret (Bubeck et al. 2012, Lattimore and Szepesvรกri 2020), our work focuses on the pure exploration setting (Koenig and Law 1985), where the goal is to select a subset of arms and evaluation is based on the final outcome. This distinction highlights the context-specific benefits of each approach: MABs are suited for tasks where the goal is to optimize rewards in real-time, balancing exploration and exploitation, whereas pure exploration is focused on identifying a set of satisfactory arms, without the immediate concern for reward maximization. In pure exploration, the algorithm prioritizes information gathering over reward collection, transforming the objective from reward-centric to information-centric. The focus is on efficiently acquiring sufficient information about all arms for confident identification.
The origins of pure exploration problems date back to the 1950s in the context of stochastic simulation, specifically within ordinal optimization (Shin et al. 2018, Shen et al. 2021) or the Ranking and Selection (R&S) problem, first addressed by Bechhofer (1954). Various methodologies have since been developed to solve the canonical R&S problem, including elimination-type algorithms (Kim and Nelson 2001, Bubeck et al. 2013, Fan et al. 2016), Optimal Computing Budget Allocation (OCBA) (Chen et al. 2000), knowledge-gradient algorithms (Frazier et al. 2008, 2009, Ryzhov et al. 2012), UCB-type algorithms (Kaufmann and Kalyanakrishnan 2013), and the unified gap-based exploration (UGapE) algorithm (Gabillon et al. 2012). Comprehensive reviews of the R&S problem can be found in Kim and Nelson (2006), Hong et al. (2021), with the most recent overview provided by Li et al. (2024).
The general framework of pure exploration encompasses various exploration tasks (Qin and You 2025), including best arm identification (BAI) (Mannor and Tsitsiklis 2004, Even-Dar et al. 2006, Russo 2020, Komiyama et al. 2023, Simchi-Levi et al. 2024), top ๐ identification (Kalyanakrishnan and Stone 2010, Kalyanakrishnan et al. 2012), threshold bandits (Locatelli et al. 2016, Abernethy et al. 2016), and satisficing bandits (Feng et al. 2025). In applications such as drug discovery, pharmacologists aim to identify a set of highly potent drug candidates from potentially millions of compounds, with only the selected candidates advancing to more extensive testing. Given the uncertainty of final outcomes and the high cost of trial-and-error, identifying multiple promising candidates simultaneously is crucial. To minimize the cost of early-stage exploration, adaptive, sequential experimental designs are necessary, as they require fewer experiments compared to fixed designs.
All ๐ -Best Arms Identification. Conventional objectives, such as identifying the top ๐ best arms or all arms above a certain threshold, often face significant challenges. In the top ๐ task, selecting a small ๐ may exclude promising candidates, while choosing a large ๐ may include ineffective options, requiring an impractically large number of experiments. Similarly, setting a threshold too high can exclude viable candidates. Both approaches depend on prior knowledge of the problem to achieve good performance, which may not be available in real-world applications.
In contrast, identifying all ๐ -best arms (those within ๐ of the best arm) overcomes these limitations. This approach promotes broader exploration while providing a robust guarantee: no significantly suboptimal arms will be selected, thereby improving the reliability of downstream tasks (Mason et al. 2020). The all ๐ -best arms identification problem generalizes both the top ๐ and threshold bandit problems. It reduces to the top ๐ problem if the number of ๐ -best arms is known in advance, and to a threshold bandit problem if the value of the best arm is known.
Mason et al. (2020) introduced the problem complexity for identifying all ๐ -best arms and derived a lower bound in the low-confidence regime. However, their lower bound involves a summation that may be unnecessary, indicating room for improvement. Building on Masonโs work, Al Marjani et al. (2022) derived tighter lower bounds by fully characterizing the alternative bandit instances that an optimal sampling strategy must distinguish and eliminate. They also proposed the asymptotically optimal Track-and-Stop algorithm. However, both Mason et al. (2020) and Al Marjani et al. (2022) consider stochastic bandits without structures. In contrast, we study this problem in the linear bandit setting (Abe and Long 1999), which leverages structural relationships among arms. This presents new challenges, but also allows it to handle more complex scenarios. As a result, our work establishes the first information-theoretic lower bound for identifying all ๐ -best arms in linear bandits, applicable to any ๐ฟ -PAC algorithm.
An extended literature review of misspecified linear bandits and generalized linear bandit models can be found in Section EC.1 of the online appendix.
2Problem Formulation
Notations. In this paper, we denote the set of positive integers up to ๐ by [ ๐ ]
{ 1 , โฆ , ๐ } . Vectors and matrices are represented using boldface notation. The inner product of two vectors is denoted by โจ โ , โ โฉ . We define the weighted matrix norm โ ๐ โ ๐จ as ๐ โค โ ๐จ โ ๐ , where ๐จ is a positive semi-definite matrix that weights and scales the norm. For two probability measures ๐ and ๐ over a common measurable space, if ๐ is absolutely continuous with respect to ๐ , the Kullback-Leibler divergence between ๐ and ๐ is defined as
KL โ ( ๐ , ๐ )
{ โซ log โก ( ๐ โ ๐ ๐ โ ๐ ) โ ๐ ๐ ,
if โ ๐ โช ๐ ;
โ ,
otherwise ,
(1)
where ๐ โ ๐ ๐ โ ๐ is the Radon-Nikodym derivative of ๐ with respect to ๐ , and ๐ โช ๐ indicates that ๐ is absolutely continuous with respect to ๐ .
Setting. We address the problem of identifying all ๐ -best arms from a finite set of ๐พ arms where ๐พ is a (possibly large) positive integer. Each arm ๐ โ [ ๐พ ] has an associated reward distribution with an unknown fixed mean ๐ ๐ . Let the mean vector of all arms be denoted as ๐
( ๐ 1 , ๐ 2 , โฆ , ๐ ๐พ ) , which can only be estimated through bandit feedback from the selected arms. Without loss of generality, we assume ๐ 1 > ๐ 2 โฅ โฆ โฅ ๐ ๐พ . The gap ฮ ๐
๐ 1 โ ๐ ๐ (for ๐ โ 1 ) represents the difference in expected rewards between the optimal arm and arm ๐ . To this end, we give a formal definition of the notion of ๐ -best arm.
Definition 2.1
( ๐ -Best Arm). Given ๐
0 , an arm ๐ is called ๐ -best if ๐ ๐ โฅ ๐ 1 โ ๐ .
Here, we adopt an additive framework to define ๐ -best arms. There also exists a multiplicative counterpart, where an arm ๐ is considered ๐ -best if ๐ ๐ โฅ ( 1 โ ๐ ) โ ๐ 1 . While our study focuses on the additive model, the analysis for the multiplicative model follows similar reasoning. We denote the set of all ๐ -best arms1 for a mean vector ๐ as
๐บ ๐ โ ( ๐ ) โ { ๐ : ๐ ๐ โฅ ๐ 1 โ ๐ } .
(2)
Define ๐ผ ๐ โ min ๐ โ ๐บ ๐ โ ( ๐ ) โก ( ๐ ๐ โ ( ๐ 1 โ ๐ ) ) as the distance from the smallest ๐ -best arm to the threshold ๐ 1 โ ๐ . Furthermore, if the complement of ๐บ ๐ โ ( ๐ ) , denoted as ๐บ ๐ ๐ โ ( ๐ ) , is non-empty, we define ๐ฝ ๐ โ min ๐ โ ๐บ ๐ ๐ โ ( ๐ ) โก ( ( ๐ 1 โ ๐ ) โ ๐ ๐ ) as the closest distance from the threshold to the highest mean value of any arm that is not considered ๐ -best.
We study this problem under a linear structure, where the mean values depend on an unknown parameter vector ๐ฝ โ โ ๐ . Each arm ๐ is associated with a feature vector ๐ ๐ โ โ ๐ . Let ๐ โ โ ๐ be the set of feature vectors, and let ๐ฟ โ [ ๐ 1 , ๐ 2 , โฆ , ๐ ๐พ ] โ โ ๐พ ร ๐ be the feature matrix. With parameter ๐ฝ , the mean value can be represented as ๐ โ ( ๐ฝ ) . For simplicity, we will consistently refer to the bandit instance as ๐ . When an arm ๐ด ๐ก with corresponding feature vector ๐ ๐ด ๐ก โ ๐ is selected at time ๐ก , we observe the bandit feedback ๐ ๐ก , given by
๐ ๐ก
๐ ๐ด ๐ก โค โ ๐ฝ + ๐ ๐ก ,
(3)
where ๐ ๐ด ๐ก
๐ ๐ด ๐ก โค โ ๐ฝ is the true mean reward of the selected arm, and ๐ ๐ก is a noise variable. We also make two additional standard assumptions on the norm of the parameters and the noise distributions (Abbasi-Yadkori et al. 2011). {assumption} Assume max ๐ โ [ ๐พ ] โก โ ๐ ๐ โ 2 โค ๐ฟ 1 , where โฅ โ โฅ 2 denotes the โ 2 -norm and ๐ฟ 1 is a constant.
{assumption}
The noise ๐ ๐ก is conditionally 1-sub-Gaussian, i.e., for any ๐ โ โ ,
๐ผ โ [ ๐ ๐ โ ๐ ๐ก | ๐ ๐ด 1 , โฆ , ๐ ๐ด ๐ก โ 1 , ๐ 1 , โฆ , ๐ ๐ก โ 1 ] โค exp โก ( ๐ 2 2 ) .
(4) 2.1Probably Approximately Correct Algorithm Framework
Our goal is to identify all ๐ -best arms with high confidence while minimizing the sampling budget. To achieve this, we employ three main components: stopping rule, sampling rule, and decision rule.
At each time step ๐ก , the stopping rule ๐ ๐ฟ determines whether to continue or stop the process. If the process continues, an arm is selected according to the sampling rule, and the corresponding random reward is observed. When the process stops at ๐ก
๐ ๐ฟ , a decision rule provides an estimate โ ^ ๐ ๐ฟ of the true solution set โ โ ( ๐ ) , which in our problem is the set of all ๐ -best arms, ๐บ ๐ โ ( ๐ ) .
We define the set of all viable mean vectors ๐ as
๐ โ { ๐ โ โ ๐พ | โ ๐ฝ โ โ ๐ , ๐
๐ฟ โ ๐ฝ โง โ ๐ ๐ โ 2 โค ๐ฟ 1 โ for each โ ๐ โ [ ๐พ ] } .
(5)
Here, the set ๐ consists of all possible mean vectors ๐ that can be expressed as a linear combination of the parameter vector ๐ฝ through the matrix ๐ฟ .
We focus on algorithms that are probably approximately correct with high confidence, referred to as ๐ฟ -PAC algorithms.
Definition 2.2
( ๐ฟ -PAC Algorithm). An algorithm is ๐ฟ -PAC for all ๐ -best arms identification if it identifies the correct solution set with a probability of at least 1 โ ๐ฟ for any problem instance with mean ๐ โ ๐ , i.e.,
โ ๐ โ ( ๐ ๐ฟ < โ , โ ^ ๐ ๐ฟ
๐บ ๐ โ ( ๐ ) ) โฅ 1 โ ๐ฟ , โ ๐ โ ๐ .
(6)
Upon stopping, ๐ฟ -PAC algorithms ensure the identification of all ๐ -best arms with high confidence. Therefore, our goal is to design a ๐ฟ -PAC algorithm that minimizes the stopping time, formulated as the following optimization problem.
min
๐ผ ๐ โ [ ๐ ๐ฟ ]
(7)
s.t.
โ ๐ โ ( ๐ ๐ฟ < โ , โ ^ ๐ ๐ฟ
๐บ ๐ โ ( ๐ ) ) โฅ 1 โ ๐ฟ , โ ๐ โ ๐ .
(8) 2.2Optimal Design of Experiment
Linear bandit algorithms can be viewed as an online, adaptive counterpart to the classical optimal design problem. This section develops the key theoretical foundations, emphasizing the confidence bounds for parameter estimation that guide both the design of our algorithm and the subsequent analysis.
Ordinary Least Squares. Consider a sequence of pulled arms, denoted as ๐ด 1 , ๐ด 2 , โฆ , ๐ด ๐ก , and the corresponding observed rewards ๐ 1 , ๐ 2 , โฆ , ๐ ๐ก . If the feature vectors of these arms, ๐ ๐ด 1 , ๐ ๐ด 2 , โฆ , ๐ ๐ด ๐ก , span the space โ ๐ , the ordinary least squares (OLS) estimator for the parameter ๐ฝ is given by
๐ฝ ^ ๐ก
๐ฝ ๐ก โ 1 โ โ ๐
1 ๐ก ๐ ๐ด ๐ โ ๐ ๐ ,
(9)
where ๐ฝ ๐ก
โ ๐
1 ๐ก ๐ ๐ด ๐ โ ๐ ๐ด ๐ โค โ โ ๐ ร ๐ represents the information matrix. Using the properties of sub-Gaussian random variables, we can derive a confidence bound for the OLS estimator. This bound, denoted as ๐ต ๐ก , ๐ฟ , is detailed in Proposition 2.3. The confidence region for the parameter ๐ฝ at time step ๐ก is given by
๐ ๐ก , ๐ฟ
{ ๐ฝ : โ ๐ฝ ^ ๐ก โ ๐ฝ โ ๐ฝ ๐ก โค ๐ต ๐ก , ๐ฟ } .
(10) Proposition 2.3 (Lattimore and Szepesvรกri (2020))
For any fixed sampling policy and any given vector ๐ฑ โ โ ๐ , with probability at least 1 โ ๐ฟ , the following holds.
| ๐ โค โ ( ๐ฝ ^ ๐ก โ ๐ฝ ) | โค โ ๐ โ ๐ฝ ๐ก โ 1 โ ๐ต ๐ก , ๐ฟ ,
(11)
where the anytime confidence bound ๐ต ๐ก , ๐ฟ is given by ๐ต ๐ก , ๐ฟ
2 โ 2 โ ( ๐ โ log โก ( 6 ) + log โก ( 1 ๐ฟ ) ) .
In many practical scenarios, the observed data are not predetermined. To handle this, a martingale-based method can be employed, as described by Abbasi-Yadkori et al. (2011), to define an adaptive confidence bound for the OLS estimator. This accounts for the variability introduced by random rewards and adaptive sampling policies. The confidence interval in Proposition 2.3 highlights the connection between arm allocation policies in linear bandits and experimental design theory (Pukelsheim 2006). This connection serves as a fundamental component in constructing our algorithm.
Feature Vector Projection. At any time step where estimation needs to be made after sampling, if the feature vectors of the sampled arms do not span โ ๐ , we substitute them with dimensionality-reduced feature vectors (Yang and Tan 2022). Specifically, we project all feature vectors onto the subspace spanned by ๐ . Let ๐ฉ โ โ ๐ ร ๐ โฒ be an orthonormal basis for this subspace, where ๐ โฒ < ๐ is the dimension of the subspace. The new feature vector ๐ โฒ is then given by ๐ โฒ
๐ฉ โค โ ๐ . In this transformation, ๐ฉ โ ๐ฉ โค is a projection matrix, ensuring
โจ ๐ฝ , ๐ โฉ
โจ ๐ฝ , ๐ฉ โ ๐ฉ โค โ ๐ โฉ
โจ ๐ฉ โค โ ๐ฝ , ๐ฉ โค โ ๐ โฉ
โจ ๐ฝ โฒ , ๐ โฒ โฉ .
(12)
Equation (12) ensures that the mean values of all arms remain unchanged under the projection. The first equality holds because ๐ฉ โ ๐ฉ โค is a projection matrix and ๐ lies in the subspace spanned by ๐ฉ (i.e., ๐ฉ โ ๐ฉ โค โ ๐
๐ ). The second equality follows from the same matrix form ๐ฝ โค โ ๐ฉ โ ๐ฉ โค โ ๐ . The third equality holds by definition.
Optimal Design Criteria. In contrast to stochastic bandits, where the mean values of the arms are estimated through repeated sampling of each arm, the linear bandit setting allows these values to be inferred from accurate estimation of the underlying parameter vector ๐ฝ . As a result, pulling a single arm provides information about all arms.
A key sampling strategy in this context is the G-optimal design, which minimizes the maximum variance of the predicted responses across all arms by optimizing the fraction of times each arm is selected. Formally, the G-optimal design problem seeks a probability distribution ๐ on ๐ , where ๐ : ๐ โ [ 0 , 1 ] and โ ๐ โ ๐ ๐ โ ( ๐ )
1 , that minimizes
๐ โ ( ๐ )
max ๐ โ ๐ โก โ ๐ โ ๐ฝ โ ( ๐ ) โ 1 2 ,
(13)
where ๐ฝ โ ( ๐ )
โ ๐ โ ๐ ๐ โ ( ๐ ) โ ๐ โ ๐ โค is the weighted information matrix, analogous to ๐ฝ ๐ก in equation (9). The G-optimal design (13) ensures a tight confidence interval for mean value estimation. However, comparing the relative differences of mean values across different arms is more critical in identifying the best arms, rather than making the best estimation.
Therefore, we consider an alternative design criterion, the ๐ณ โ ๐ด -optimal design, that directly targets the estimation of these gaps. Consider ๐ฎ โ ๐ as a subset of the arm space. We define
๐ด โ ( ๐ฎ ) โ { ๐ โ ๐ โฒ : โ ๐ , ๐ โฒ โ ๐ฎ , ๐ โ ๐ โฒ }
(14)
as the set of vectors representing the differences between each pair of arms in ๐ฎ . The ๐ณ โ ๐ด -optimal design minimizes
๐ ๐ณ โ ๐ด โ ( ๐ )
max ๐ โ ๐ด โ ( ๐ ) โก โ ๐ โ ๐ฝ โ ( ๐ ) โ 1 2 .
(15)
As mentioned previously, the ๐ณ โ ๐ด -optimal design focuses on minimizing the maximum variance when estimating the differences (gaps) between pairs of arms. By doing so, it ensures differentiation between arms, rather than estimating each arm individually. This criterion is particularly useful when the goal is to identify relative performance rather than absolute quality.
3Lower Bound and Problem Complexity
In this section, we present a novel information-theoretic lower bound for the problem of identifying all ๐ -best arms in linear bandits. Building on the approach of Soare et al. (2014), we extend the lower bound for best arm identification (BAI) to this more general setting. Figure 1 visualizes the structure of the stopping condition, with additional graphical insights provided in Section EC.4.1. These visualizations offer geometric intuition for the challenges involved in identifying all ๐ -best arms in linear bandits.
Figure 1:Illustration of the Stopping Condition: Best Arm Identification vs. All ๐ -Best Arms Identification
Note. (a) Stopping occurs when the confidence region ๐ ๐ก , ๐ฟ for the estimated parameter ๐ ^ ๐ก contracts entirely within one of the three decision regions ๐ ๐ in a certain time step ๐ก . The boundaries between regions are defined by the hyperplanes ๐ โค โ ( ๐ ๐ โ ๐ ๐ )
0 . Each dot represents an arm. (b) In the case of identifying all ๐ -best arms, the regions overlap. (c) These overlaps partition the space into seven distinct decision regions, increasing the difficulty of identification.
The sample complexity of an algorithm is quantified by the number of samples, denoted ๐ ๐ฟ , required to terminate the process. The goal of the algorithm design is to minimize the expected sample complexity ๐ผ ๐ โ [ ๐ ๐ฟ ] across the entire set of algorithms โ . As introduced in Kaufmann et al. (2016), for ๐ฟ โ ( 0 , 1 ) , the non-asymptotic problem complexity of an instance ๐ can be defined as
๐ โ ( ๐ ) โ inf ๐ด โ ๐ โ ๐ โ ๐ โ โ ๐ผ ๐ โ [ ๐ ๐ฟ ] log โก ( 1 2.4 โ ๐ฟ ) ,
(16)
which is the smallest possible constant such that the expected sample complexity ๐ผ ๐ โ [ ๐ ๐ฟ ] grows asymptotically in line with log โก ( 1 2.4 โ ๐ฟ ) . The lower bound of the sample complexity ๐ผ ๐ โ [ ๐ ๐ฟ ] can be represented in a general form by the following proposition. Building on the analytical framework of Proposition 3.1, we formulate the ๐ -best arms identification problem in the linear bandit setting. This enables us to derive both lower and upper bounds on the sample complexity, thereby establishing the near-optimality of our algorithm.
Proposition 3.1 (Qin and You (2025))
For any ๐ โ ๐ , there exists a set ๐ณ
๐ณ โ ( ๐ ) and functions { ๐ถ ๐ฅ } ๐ฅ โ ๐ณ with ๐ถ ๐ฅ : ๐ฎ ๐พ ร ๐ณ โ โ + such that
๐ โ ( ๐ ) โฅ ( ฮ ๐ โ ) โ 1 ,
(17)
where
ฮ ๐ โ
max ๐ โ ๐ฎ ๐พ โก min ๐ฅ โ ๐ณ โก ๐ถ ๐ฅ โ ( ๐ ; ๐ ) .
(18)
In Proposition 3.1, ๐ฎ ๐พ denotes the ๐พ -dimensional probability simplex, and ๐ณ
๐ณ โ ( ๐ ) is referred to as the culprit set. This set comprises critical subqueries (or comparisons) that must be correctly resolved. An error in any of these comparisons may hinder the identification of the correct set.
For example, in the case of identifying the best arm, the culprit set is given by ๐ณ
{ ๐ : ๐ โ [ ๐พ ] โ ๐ โ } , where ๐ โ denotes the unique best arm, and each subquery involves distinguishing every arm from the best arm. In the threshold bandit problem, the culprit set consists of all arms, ๐ณ
{ ๐ : ๐ โ [ ๐พ ] } , where each subquery requires accurately determining whether each arm exceeds the threshold. For the task of identifying the best ๐ arms, the culprit set is ๐ณ
{ ( ๐ , ๐ ) : ๐ โ โ , ๐ โ โ ๐ } , where โ represents the set of the best ๐ arms, and each subquery entails comparing the mean of each arm in โ with those in the complement set โ ๐ .
The function ๐ถ ๐ฅ โ ( ๐ ; ๐ ) represents the population version of the sequential generalized likelihood ratio statistic, which provides an information-theoretic measure of how easily each subquery, corresponding to the culprit ๐ฅ โ ๐ณ , can be answered.
In equation (18), the minimum aligns with the intuition that the instance posing the greatest challenge corresponds to the hardest subquery. The outer maximization seeks the optimal allocation of arms ๐ to effectively address this subquery. For a more detailed introduction to this general pure exploration model, please refer to Section EC.2. For our setting, we establish the below lower bound.
Theorem 3.2 (Lower Bound)
Consider a set of arms where arm ๐ follows a normal distribution ๐ฉ โ ( ๐ ๐ , 1 ) , where ๐ ๐
๐ ๐ โค โ ๐ . Any ๐ฟ -PAC algorithm for identifying all ๐ -best arms in the linear bandit setting must satisfy
inf ๐ด โ ๐ โ ๐ โ ๐ โ โ ๐ผ ๐ โ [ ๐ ๐ฟ ] log โก ( 1 2.4 โ ๐ฟ ) โฅ ( ฮ ๐ โ ) โ 1
min ๐ โ ๐ฎ ๐พ โก max ( ๐ , ๐ , ๐ ) โ ๐ณ โก max โก { 2 โ โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ 1 โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 } ,
(19)
where ๐ณ
{ ( ๐ , ๐ , ๐ ) : ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) } , ๐ฎ ๐พ is the ๐พ -dimensional probability simplex.
The detailed proof of the above theorem is presented in Section EC.4.3. For the lower bound derivation, we assume normally distributed rewards to obtain a closed-form expression. A similar bound can be derived under sub-Gaussian rewards, though the form is less explicit.
Remark 3.3 (Generality of the Lower Bound)
We note that the stochastic multiโarmed bandit problem is a special case of the linear bandit problem. By setting ๐
{ ๐ 1 , ๐ 2 , โฆ , ๐ ๐ } , where ๐ ๐ denotes the unit vector, the linear bandit model reduces to a stochastic setting. This relationship allows us to recover the lower bound result for identifying all ๐ -best arms in stochastic bandits (Al Marjani et al. 2022). Furthermore, the lower bound in Theorem 3.2 extends the lower bound for best arm identification in linear bandits (Fiez et al. 2019). This result is recovered by setting ๐
0 and redefining the culprit set as ๐ณ โ ( ๐ )
{ ๐ : ๐ โ [ ๐พ ] โ ๐ โ } , where ๐ โ represents the best arm in the context of best arms identification.
4Algorithm and Upper Bound
In this section, we propose the LinFACT algorithm (Linear Fast Arm Classification with Threshold estimation) to identify all ๐ -best arms in linear bandits efficiently. We then establish upper bounds on the expected sample complexity to demonstrate the optimality of the LinFACT algorithm. Specifically, the upper bound derived from the ๐ณ โ ๐ด -optimal sampling policy is shown to be instance optimal up to logarithmic factors.2
4.1Algorithm
LinFACT is a phase-based, semi-adaptive algorithm in which the sampling rule remains fixed within each round and is updated only at the end based on the accumulated observations. As the algorithm proceeds through round ๐ , LinFACT progressively refines two sets of arms:
โข
๐บ ๐ : Arms empirically classified as ๐ -best (good).
โข
๐ต ๐ : Arms empirically classified as not ๐ -best (bad).
This classification process continues until all arms have been assigned to either ๐บ ๐ or ๐ต ๐ . Once complete, the decision rule returns ๐บ ๐ as the final set of ๐ -best arms.
Sampling Rule. To minimize the sampling budget, we select arms that provide the maximum information about the mean values or the gaps between them. Unlike stochastic multi-armed bandits, where mean values are obtained exclusively by sampling specific arms, linear bandits allow these mean values to be inferred from the estimated parameters. In each round, arms are selected based on the G-optimal design (13) or the ๐ณ โ ๐ด -optimal design (15).
For G-optimal design, LinFACT-G refines an estimate of the true parameter ๐ฝ and uses this estimate to maintain an anytime confidence interval, such that for each armโs empirical mean value ๐ ^ ๐ , we have
โ โ ( โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ | โค ๐ถ ๐ฟ / ๐พ โ ( ๐ ) ) โฅ 1 โ ๐ฟ .
(20)
The active set ๐ โ ( ๐ ) is defined as the set of uneliminated arms, as we continuously eliminate arms as round ๐ progresses. ๐ ๐ผ โ ( ๐ ) denotes the set of indices corresponding to ๐ โ ( ๐ ) .
This confidence bound indicates that the algorithm maintains a probabilistic guarantee that the true mean value ๐ ๐ is within a certain range of the estimated mean value ๐ ^ ๐ for each arm ๐ , uniformly over all rounds. The bound shrinks as more data is collected (since the confidence radius ๐ถ ๐ฟ / ๐พ โ ( ๐ ) decreases with more samples), thereby reducing uncertainty. The anytime confidence width ๐ถ ๐ฟ / ๐พ โ ( ๐ ) is maintained by the design of the sample budget in each round. We set ๐ถ ๐ฟ / ๐พ ( ๐ )
2 โ ๐
: ๐ ๐ , which is halved with each iteration of the rounds.
In LinFACT-G, the initial budget allocation policy is based on the G-optimal design and is defined as follows
{ ๐ ๐ โ ( ๐ )
โ 2 โ ๐ โ ๐ ๐ โ ( ๐ ) ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ
๐ ๐
โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) ,
(21)
where ๐ ๐ denotes the total sampling budget allocated in round ๐ , and ๐ ๐ is the selection probability distribution over the remaining active arms ๐ โ ( ๐ โ 1 ) from the previous round, obtained via the G-optimal design as defined in equation (13). The sampling procedure for each round ๐ is described in Algorithm 4.1.
Algorithm 1 Subroutine: G-Optimal Sampling
1:Input: Projected active set ๐ โ ( ๐ โ 1 ) , round ๐ , ๐ฟ . 2:Obtain ๐ ๐ โ ๐ซ โ ( ๐ โ ( ๐ โ 1 ) ) with support size Supp โ ( ๐ ๐ ) โค ๐ โ ( ๐ + 1 ) 2 according to equation (13). 3:for all ๐ โ ๐ โ ( ๐ โ 1 ) do โณ Sampling 4:โโSample arm ๐ for ๐ ๐ โ ( ๐ ) times in round ๐ , as specified in equation (21).
We also adopt the ๐ณ โ ๐ด -optimal design because the G-optimal design is less effective for distinguishing between arms. While the G-optimal design minimizes the maximum variance in estimating individual arm means, it does not explicitly focus on the pairwise gaps that are critical for identifying good arms. In contrast, the ๐ณ โ ๐ด -optimal design is tailored to directly reduce the uncertainty in estimating these gaps.
We now introduce a sampling rule based on the ๐ณ โ ๐ด -optimal design, as defined in equation (15). Let ๐ โ ( ๐ ) denote the error introduced by the rounding procedure. We have:
{ ๐ ๐
max โก { โ 2 โ ๐ ๐ณ โ ๐ด โ ( ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) ) โ ( 1 + ๐ ) ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ , ๐ โ ( ๐ ) }
๐ ๐ โ ( ๐ )
Round โ ( ๐ ๐ , ๐ ๐ ) .
(22)
In contrast to the G-optimal design, the ๐ณ โ ๐ด -optimal design focuses on bounding the confidence region of the pairwise differences between arms. The following inequality characterizes the corresponding high-probability event.
โ โ ( โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , ๐ โ ๐ โ ๐ โ โ | ( ๐ ^ ๐ โ ( ๐ ) โ ๐ ^ ๐ โ ( ๐ ) ) โ ( ๐ ๐ โ ๐ ๐ ) | โค 2 โ ๐ถ ๐ฟ / ๐พ โ ( ๐ ) ) โฅ 1 โ ๐ฟ .
(23)
The rounding operation, denoted as Round, uses a ( 1 + ๐ ) approximation algorithm proposed by Allen-Zhu et al. (2017). The complete sampling procedure is outlined in Algorithm 4.1.
Algorithm 2 Subroutine: ๐ณ โ ๐ด -Optimal Sampling
1:Input: Projected active set ๐ โ ( ๐ โ 1 ) , round ๐ , ๐ฟ . 2:Obtain ๐ ๐ โ ๐ซ โ ( ๐ โ ( ๐ โ 1 ) ) according to equation (15). 3:for all ๐ โ ๐ โ ( ๐ โ 1 ) do โณ Sampling 4:โโSample arm ๐ for ๐ ๐ โ ( ๐ ) times in round ๐ , as specified in equation (22).
Estimation. At the end of each round, after drawing ๐ ๐ โ ( ๐ ) samples from the active set, we compute the empirical estimate of the parameter using standard ordinary least squares (OLS)
๐ฝ ^ ๐
๐ฝ ๐ โ 1 โ โ ๐
1 ๐ ๐ ๐ ๐ด ๐ โ ๐ ๐ ,
(24)
where ๐ฝ ๐
โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) โ ๐ โ ๐ โค is the information matrix. The estimator for the mean value of each arm ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) is then
๐ ^ ๐
๐ ^ ๐ โ ( ๐ )
๐ ๐ โค โ ๐ฝ ^ ๐ .
(25)
Stopping Rule and Decision Rule. As round ๐ progresses, LinFACT dynamically updates two sets of arms: ๐บ ๐ and ๐ต ๐ , representing arms that are empirically considered ๐ -best (good) and those that are not (bad), respectively. The algorithm filters arms by maintaining an upper confidence bound ๐ ๐ and a lower confidence bound ๐ฟ ๐ around the unknown threshold ๐ 1 โ ๐ , along with individual upper and lower confidence bounds for each arm.
The stopping rule and final decision procedure are described in Algorithm 4.1. For each arm ๐ in the active set, LinFACT eliminates the arm if its upper confidence bound falls below the threshold ๐ฟ ๐ (line 6). Conversely, if the lower confidence bound of an arm exceeds ๐ ๐ (line 8), the arm is added to the set ๐บ ๐ . Additionally, any arm already in ๐บ ๐ will be removed from the active set if its upper bound falls below the empirically largest lower bound among all active arms (line 10). This ensures that the best arm is always retained in the active set, which is necessary for estimating the threshold ๐ 1 โ ๐ . The classification process continues until all arms are categorized, that is, when ๐บ ๐ โช ๐ต ๐
[ ๐พ ] . At termination, the set ๐บ ๐ is returned as the output of LinFACT, representing the arms identified as ๐ -best.
Algorithm 3 Subroutine: Stopping Rule and Decision Rule
1:Input: Projected active set ๐ ๐ผ โ ( ๐ โ 1 ) , estimator ( ๐ ^ ๐ โ ( ๐ ) ) ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , round ๐ , ๐ , confidence radius ๐ถ ๐ฟ / ๐พ โ ( ๐ ) . 2:Let ๐ ๐
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ + ๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ . 3:Let ๐ฟ ๐
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ . 4:for all ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) do โณ Arm Classification and Elimination 5:โโif ๐ ^ ๐ + ๐ถ ๐ฟ / ๐พ โ ( ๐ ) < ๐ฟ ๐ then 6:โโโAdd ๐ to ๐ต ๐ and eliminate ๐ from ๐ ๐ผ โ ( ๐ โ 1 ) . โโ 7:โโif ๐ ^ ๐ โ ๐ถ ๐ฟ / ๐พ โ ( ๐ ) > ๐ ๐ then 8:โโโAdd ๐ to ๐บ ๐ . โโ 9:โโif ๐ โ ๐บ ๐ and ๐ ^ ๐ + ๐ถ ๐ฟ / ๐พ โ ( ๐ ) โค max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ถ ๐ฟ / ๐พ โ ( ๐ ) then 10:โโโEliminate ๐ from ๐ ๐ผ โ ( ๐ โ 1 ) . โโ 11:if ๐บ ๐ โช ๐ต ๐
[ ๐พ ] then โณ Stopping Condition and Recommendation 12:โโOutput: the set ๐บ ๐ .
The Complete LinFACT Algorithm. The complete LinFACT algorithm is presented in Algorithm 4.1. The procedure proceeds as follows: based on the collected data, the decision-maker updates the parameter estimates and checks whether the stopping condition ๐ ๐ฟ is satisfied. If so, the set ๐บ ๐ is returned as the estimated set of all ๐ -best arms. If the stopping condition is not met, the process continues with further sampling and updates.
Algorithm 4 LinFACT Algorithm
1:Input: ๐ , ๐ฟ , bandit instance. 2:Initialize ๐บ 0
โ , the set of good arms, and ๐ต 0
โ , the set of bad arms. 3:Initialize the active set ๐ โ ( 0 )
๐ , and ๐ ๐ผ โ ( 0 )
[ ๐พ ] . 4:for ๐
1 , 2 , โฆ do 5:โโSet ๐ถ ๐ฟ / ๐พ โ ( ๐ )
๐ ๐
2 โ ๐ . 6:โโSet ๐บ ๐
๐บ ๐ โ 1 and ๐ต ๐
๐ต ๐ โ 1 . 7:โโProject ๐ โ ( ๐ โ 1 ) to a ๐ ๐ -dimensional subspace that ๐ โ ( ๐ โ 1 ) spans. โณ Projection 8:โโif Using G-optimal Sampling then โณ Sampling 9:โโโCall Algorithm 4.1. 10:โโelse if Using ๐ณ โ ๐ด -optimal Sampling then 11:โโโCall Algorithm 4.1. โโ 12:โโEstimate ( ๐ ^ ๐ โ ( ๐ ) ) ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) using equations (24) and (25). โณ Estimation 13:โโCall Algorithm 4.1. โณ Stopping Condition and Decision Rule 4.2Upper Bounds of the LinFACT Algorithm
Theorems 4.1 and 4.2 establish upper bounds on the sample complexity of the proposed LinFACT algorithm.
Let ๐ ๐บ and ๐ ๐ณ โ ๐ด denote the number of samples required under the G-optimal and ๐ณ โ ๐ด -optimal designs, respectively. The formal statements of these theorems are given below.
Theorem 4.1 (Upper Bounds, G-Optimal Design)
For ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 , there exists an event โฐ such that โ โ ( โฐ ) โฅ 1 โ ๐ฟ . On this event, the LinFACT algorithm with the G-optimal sampling policy achieves an expected sample complexity upper bound given by
๐ผ โ [ ๐ ๐บ โฃ โฐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log 2 โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(26)
The detailed proof of Theorem 4.1 is presented in Section EC.5. However, the LinFACT algorithm based on the G-optimal design does not yield an upper bound that aligns with the lower bound. This limitation is discussed further in Section EC.6. In contrast, we will show that the algorithm using the ๐ณ โ ๐ด -optimal design achieves an upper bound that matches the lower bound up to a logarithmic factor.
Theorem 4.2 (Upper Bound, ๐ณ โ ๐ด -Optimal Design)
Assume that an instance of arms satisfies min ๐ โ ๐บ ๐ โ { 1 } โก โ ๐ 1 โ ๐ ๐ โ 2 โฅ ๐ฟ 2 and max ๐ โ [ ๐พ ] โก | ๐ 1 โ ๐ โ ๐ ๐ | โค 2 . There exists an event โฐ such that โ โ ( โฐ ) โฅ 1 โ ๐ฟ . On this event, the LinFACT algorithm with the ๐ณ โ ๐ด -optimal sampling policy achieves an expected sample complexity upper bound given by
๐ผ โ [ ๐ ๐ณ โ ๐ด โฃ โฐ ]
๐ช โ ( ( ฮ โ ) โ 1 โ ๐ โ ๐ โ 1 โ log โก ( ๐ โ 1 ) โ log โก ( ๐พ ๐ฟ โ log โก ( ๐ โ 2 ) ) + ๐ ๐ 2 โ log โก ( ๐ โ 1 ) ) ,
(27)
where ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 is the minimum gap of the problem instance, ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } , and ( ฮ โ ) โ 1 is the lower bound term defined in Theorem 3.2.
The proof of the near-optimal upper bound in Theorem 4.2 is presented in Section EC.7 of the online appendix, where we make it clear how the lower bound helps to establish our upper bound.
5Model Misspecification
In this section, we address the challenge of model misspecification, recognizing that real-world problems may deviate from perfect linearity. To account for such deviations, we propose an orthogonal parameterization-based algorithm, i.e., LinFACT-MIS3, a refined version of LinFACT to address model misspecification. We establish new upper bounds in the misspecified setting and provide insights into how such deviations impact algorithm performance.
Under model misspecification, we refine the linear model in equation (3) as
๐ ๐ก
๐ ๐ด ๐ก โค โ ๐ฝ + ๐ ๐ก + ฮ ๐ โ ( ๐ ๐ด ๐ก ) ,
(28)
where ฮ ๐ : โ ๐ โ โ is a misspecification function quantifying the deviation from the true model.
{assumption}
Assume โ ๐ โ โ โค ๐ฟ โ and โ ๐ซ ๐ โ โ โค ๐ฟ ๐ , where โฅ โ โฅ โ denotes the infinity norm and the bold ๐พ -dimensional vector ๐ซ ๐ represents the bias term of the misspecified model.
Therefore, with this assumption, the set of realizable models is defined as
๐ โ { ๐ โ โ ๐พ | โ ๐ฝ โ โ ๐ , โ ๐ซ ๐ โ โ ๐พ , ๐
๐ฟ โ ๐ฝ + ๐ซ ๐ โง โ ๐ โ โ โค ๐ฟ โ โง โ ๐ซ ๐ โ โ โค ๐ฟ ๐ } .
(29)
The key distinction in the analysis under model misspecification lies in how the estimator ๐ ^ ๐ก is maintained. Specifically, we construct this estimator by projecting the empirical mean vector ๐ ~ ๐ก at time ๐ก onto the set of realizable models ๐ via the following optimization
๐ ^ ๐ก โ arg โก min ๐ โ ๐ โก โ ๐ โ ๐ ~ ๐ก โ ๐ซ ๐ต ๐ก 2 ,
(30)
where ๐ต ๐ก
[ ๐ ๐ก โ 1 , ๐ ๐ก โ 2 , โฆ , ๐ ๐ก โ ๐พ ] โค โ โ ๐พ is the vector of sample counts for each arm at time ๐ก , and ๐ซ ๐ต ๐ก โ โ ๐พ ร ๐พ is the diagonal matrix with ๐ ๐ก โ 1 , ๐ ๐ก โ 2 , โฆ , ๐ ๐ก โ ๐พ as its diagonal entries.
Figure 2:Difference Between Standard OLS and Misspecification-Adjusted Projection Estimates
Note. The left diagram shows the projection onto the span of pulled arms under a perfect linear model. The right diagram depicts the adjustment required under misspecification, where the projection must account for the deviation.
In the absence of misspecification, this projection simplifies to the ordinary least squares (OLS) estimator. However, as shown in Figure 2, under model misspecification, the estimator can no longer be computed as a simple projection onto a hyperplane and falls outside the scope of standard OLS. Instead, it must be formulated as the optimization problem in equation (30), which minimizes a weighted quadratic objective over ๐พ + ๐ variables, subject to the constraints โ ๐ โ โ โค ๐ฟ โ and โ ๐ซ ๐ โ โ โค ๐ฟ ๐ .
5.1Upper Bound with Misspecification
In this subsection, we present the upper bound on the expected sample complexity for LinFACT-MIS. This analysis highlights the influence of misspecification on the theoretical performance of the algorithm. Let ๐ ๐บ mis denote the number of samples taken under model misspecification.
Theorem 5.1 (Upper Bound, Misspecification)
Fix ๐ > 0 and suppose that the magnitude of misspecification satisfies ๐ฟ ๐ < min โก { ๐ผ ๐ 2 โ ๐ , ๐ฝ ๐ 2 โ ๐ } . For ๐
min โก ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ , ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) / 16 , there exists an event โฐ such that โ โ ( โฐ ) โฅ 1 โ ๐ฟ . On this event, LinFACT-MIS terminates and returns the correct solution with an expected sample complexity upper bound given by
๐ผ ๐ โ [ ๐ ๐บ mis โฃ โฐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(31)
The proof of this theorem is provided in Section EC.8. The upper bound in Theorem 4.1, which assumes no model misspecification, can be viewed as a special case of Theorem 5.1 by setting ๐ฟ ๐
0 . However, the bound in Theorem 5.1 becomes invalid when the misspecification magnitude ๐ฟ ๐ is too large, as the logarithmic terms may involve negative arguments, violating the assumptions required for the bound to hold. If the variation in confidence radius across arms due to misspecification is not accounted for, Theorem 5.1 suggests that the sample complexity will increase. In particular, compared to the bound in Theorem 4.1, this result is looser because its denominator terms about ๐ decrease from ๐ผ ๐ and ๐ฝ ๐ to ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ and ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ , respectively.
5.2Orthogonal Parameterization
In this section, we present an alternative version of LinFACT-MIS based on orthogonal parameterization, designed to improve computational efficiency.
Figure 3:Orthogonal Parameterization and Projection.
Note. The estimator ๐ ^ ๐ก is obtained from the empirical mean ๐ ~ ๐ก by solving an optimization problem. While the true mean vector ๐ can be expressed as the sum of a linear component ๐ฟ โ ๐ and a non-linear model deviation ๐ซ ๐ , it can also be decomposed at each time step ๐ก via orthogonal projection into a linear part ๐ฟ โ ๐ ๐ก on the hyperplane and a residual term ๐ซ ๐ โ ( ๐ก ) orthogonal to it.
Orthogonal Parameterization. Under model misspecification, traditional confidence bounds for the mean estimator based on โ ๐ฝ ^ ๐ก โ ๐ฝ โ ๐ฝ ๐ก 2 , derived using either martingale-based methods (Abbasi-Yadkori et al. 2011) or covering arguments (Lattimore and Szepesvรกri 2020), are no longer directly applicable due to the presence of an additional misspecification term. To improve the concentration of the estimator in this setting, a key strategy is to adopt an orthogonal parameterization of the mean vectors within the realizable model ๐ (Rรฉda et al. 2021).
Rather than centering the confidence region around the true parameter ๐ฝ , we focus on the quantity โ ๐ฝ ^ ๐ก โ ๐ฝ ๐ก โ ๐ฝ ๐ก 2 , where ๐ฝ ๐ก is the orthogonal projection of the true mean vector onto the feature space spanned by the pulled arms at time ๐ก . This ๐ฝ ๐ก -centered form corresponds to a self-normalized martingale and thus satisfies the same concentration bounds as in the classical linear bandit setting without misspecification. This approach offers an advantage over prior methods (Lattimore et al. 2020, Zanette et al. 2020), which require inflating the confidence radius between ๐ฝ ^ ๐ก and ๐ฝ by a factor of ๐ฟ ๐ 2 โ ๐ก , leading to overly conservative bounds in misspecified settings where ๐ฟ ๐ โซ 0 .
Specifically, we show that any mean vector ๐
๐ฟ โ ๐ฝ + ๐ซ ๐ can be equivalently expressed at any time ๐ก as ๐
๐ฟ โ ๐ฝ ๐ก + ๐ซ ๐ โ ( ๐ ) , where
๐ฝ ๐ก
( ๐ฟ ๐ ๐ก โค โ ๐ฟ ๐ ๐ก ) โ 1 โ ๐ฟ ๐ ๐ก โค โ ๐ซ ๐ ๐ก 1 / 2 โ ๐
๐ฝ ๐ก โ 1 โ โ ๐
1 ๐ก ๐ ๐ด ๐ โ ๐ ๐ด ๐
(32)
is the orthogonal projection of ๐ onto the feature space spanned by the columns of ๐ฟ ๐ ๐ก , and ๐ซ ๐ โ ( ๐ )
๐ โ ๐ฟ โ ๐ฝ ๐ก is the residual. Here, ๐ฟ ๐ ๐ก
๐ซ ๐ ๐ก 1 / 2 โ ๐ฟ is the matrix of feature vectors weighted by the number of times each arm has been sampled up to time ๐ก , and ๐ซ ๐ ๐ก 1 / 2 is a diagonal matrix with entries corresponding to the square roots of the number of samples for each arm.
Upper Bound. For LinFACT-MIS, the orthogonal parameterization involves updating the sampling rule in Algorithm 4.1 to the sampling rule in Algorithm 5.2. The estimation is no longer based on OLS but is achieved by calculating the estimator ( ๐ ^ ๐ โ ( ๐ ) ) ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) with observed data by solving the optimization problem described in equation (30) directly.
When model misspecification is accounted for and orthogonal parameterization is applied, the sampling policy is given by:
{ ๐ ๐ โ ( ๐ )
โ 8 โ ๐ โ ๐ ๐ โ ( ๐ ) ๐ ๐ 2 โ ( ๐ โ log โก ( 6 ) + log โก ( ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) ) โ
๐ ๐
โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) .
(33)
Let ๐ ๐บ op denote the total number of samples required when orthogonal parameterization is used. The corresponding upper bound is stated in the theorem below.
Algorithm 5 Subroutine: Sampling With Orthogonal Parameterization
1:Input: Projected active set ๐ ๐ผ โ ( ๐ โ 1 ) , round ๐ , ๐ฟ . 2:Find the G-optimal design ๐ ๐ โ ๐ซ โ ( ๐ โ ( ๐ โ 1 ) ) with Supp ( ๐ ๐ ) โค ๐ โ ( ๐ + 1 ) 2 according to equation (13). 3:for all ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) do โณ Sampling 4:โโSample arm ๐ for ๐ ๐ โ ( ๐ ) times in round ๐ , as specified in equation (33). Theorem 5.2 (Upper Bound, Orthogonal Parameterization)
Fix ๐ > 0 and suppose that the magnitude of misspecification satisfies ๐ฟ ๐ < min โก { ๐ผ ๐ 2 โ ( ๐ + 2 ) , ๐ฝ ๐ 2 โ ( ๐ + 2 ) } . For ๐
min โก ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) , ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) / 16 , there exists an event โฐ such that โ โ ( โฐ ) โฅ 1 โ ๐ฟ . On this event, LinFACT-MIS terminates and returns the correct solution with an expected sample complexity upper bound given by
๐ผ ๐ โ [ ๐ ๐บ op โฃ โฐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ โ 6 ๐ ๐ฟ โ log โก ( ๐ โ 1 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(34)
This theorem establishes that the upper bound remains of the same order as in Theorem 5.1, with the detailed proof provided in Section EC.9. As in Theorem 5.1, the use of orthogonal parameterization does not eliminate the expansion of the upper bound, which remains unavoidable. We provide further intuition in Section 5.3, arguing that without prior knowledge of the misspecification, it is not possible to recover full performance through algorithmic refinement alone.
5.3Insights for Model Misspecification
Lower Bounds in Linear and Stochastic Settings. The lower bound becomes equivalent to the unstructured lower bound as soon as the misspecification upper bound ๐ฟ ๐
๐ฟ ๐ , where ๐ฟ ๐ is an instance-dependent finite constant. This observation is formalized in Proposition 5.3, whose proof follows the same logic as Lemma 2 in Rรฉda et al. (2021).
Proposition 5.3
There exists ๐ฟ ๐ โ โ with ๐ฟ ๐ โค max ๐ โก ๐ ๐ โ min ๐ โก ๐ ๐ such that if ๐ฟ ๐
๐ฟ ๐ , then for any pure exploration task, the lower bound in the linear setting is equal to the unstructured lower bound.
Improvement with Unknown Misspecification is Not Possible. Knowing that a problem is misspecified without access to an upper bound ๐ฟ ๐ on โ ๐ซ ๐ โ โ is effectively equivalent to having no structural knowledge of the problem. As a result, improving algorithmic performance under such unknown model misspecification is infeasible. In particular, as shown in cumulative regret settings, sublinear regret guarantees are no longer achievable (Ghosh et al. 2017, Lattimore et al. 2020); similarly, in pure exploration, the theoretical lower bound cannot be attained.
Prior Knowledge for the Misspecification. When the upper bound ๐ฟ ๐ on model misspecification is known in advance, LinFACT-MIS can be modified to account for this deviation. Specifically, we adjust the confidence radius ๐ถ ๐ฟ / ๐พ โ ( ๐ ) used in computing the lower and upper bounds (i.e., ๐ฟ ๐ and ๐ ๐ ) in Algorithm 4.1. With this modification, the number of rounds required to complete classification under misspecification, ๐ upper โฒ and ๐ upper โฒโฒ , coincides with ๐ upper , the corresponding number of rounds under a perfectly linear model. This is achieved by replacing the confidence radius ๐ ๐ with an inflated version ๐ ๐ + ๐ฟ ๐ โ ๐ to compensate for the worst-case deviation due to misspecification. This adjustment preserves the validity of the original analysis and ensures that the same theoretical guarantees are retained. The following proposition formalizes this observation and is proved in Section EC.10.
Proposition 5.4
Suppose the misspecification magnitude ๐ฟ ๐
0 is known in advance. Adjusting the confidence radius ๐ถ ๐ฟ / ๐พ โ ( ๐ ) in Algorithm 4.1 at each round ๐ from its original value ๐ ๐ to
๐ ๐ โฒ
๐ ๐ + ๐ฟ ๐ โ ๐ ,
(35)
ensures that the total number of rounds required under misspecification matches that under perfect linearity.
6Generalized Linear Model
In this section, we extend the linear bandits to a generalized linear model (GLM). In this setting, the reward function no longer follows the standard linear form in equation (3), but instead satisfies
๐ผ โ [ ๐ ๐ก โฃ ๐ด ๐ก ]
๐ link โ ( ๐ ๐ด ๐ก โค โ ๐ฝ ) ,
(36)
where ๐ link : โ โ โ is the inverse link function. GLMs encompass a class of models that include, but are not limited to, linear models, allowing for various reward distributions beyond the Gaussian. For example, for binary-valued rewards, a suitable choice of ๐ link is ๐ link โ ( ๐ฅ )
exp โก ( ๐ฅ ) / ( 1 + exp โก ( ๐ฅ ) ) , i.e., sigmoid function, leading to the logistic regression model. For integer-valued rewards, ๐ link โ ( ๐ฅ )
exp โก ( ๐ฅ ) leads to the Poisson regression model.
To keep this paper self-contained, we briefly review the main properties of GLMs (McCullagh 2019). A univariate probability distribution belongs to a canonical exponential family if its density with respect to a reference measure is given by
๐ ๐ โ ( ๐ฅ )
exp โก ( ๐ฅ โ ๐ โ ๐ โ ( ๐ ) + ๐ โ ( ๐ฅ ) ) ,
(37)
where ๐ is a real parameter, ๐ โ ( โ ) is a real normalization function, and ๐ โ ( โ ) is assumed to be twice continuously differentiable. This family includes the Gaussian and Gamma distributions when the reference measure is the Lebesgue measure, and the Poisson and Bernoulli distributions when the reference measure is the counting measure on the integers. For a random variable ๐ with density defined in (37), ๐ผ โ ( ๐ )
๐ ห โ ( ๐ ) and Var โ ( ๐ )
๐ ยจ โ ( ๐ ) , where ๐ ห and ๐ ยจ denote the first and second derivatives of ๐ , respectively. Since the variance is always positive and ๐ link
๐ ห represents the inverse link function, ๐ is strictly convex and ๐ link is increasing.
The canonical GLM assumes that ๐ ๐ฝ โ ( ๐ โฃ ๐ ๐ )
๐ ๐ ๐ โค โ ๐ฝ โ ( ๐ ) for all arms ๐ . The maximum likelihood estimator ๐ฝ ^ ๐ก , based on ๐ -algebra โฑ ๐ก
๐ โ ( ๐ด 1 , ๐ 1 , ๐ด 2 , ๐ 2 , โฆ , ๐ด ๐ก , ๐ ๐ก ) , is defined as the maximizer of the function
โ ๐
1 ๐ก log โก ๐ ๐ฝ โ ( ๐ ๐ | ๐ ๐ด ๐ )
โ ๐
1 ๐ก ๐ ๐ โ ๐ ๐ด ๐ โค โ ๐ฝ โ ๐ โ ( ๐ ๐ด ๐ โค โ ๐ฝ ) + ๐ โ ( ๐ ๐ ) .
(38)
This function is strictly concave in ๐ฝ . By differentiating, we obtain that ๐ฝ ^ ๐ก is the unique solution of the following estimating equation at time ๐ก ,
โ ๐
1 ๐ก ( ๐ ๐ โ ๐ link โ ( ๐ ๐ด ๐ โค โ ๐ฝ ) ) โ ๐ ๐ด ๐
๐ .
(39)
In practice, while the solution to equation (39) does not have a closed-form solution, it can be efficiently found using methods such as iteratively reweighted least squares (IRLS) (Wolke and Schwetlick 1988), which employs Newtonโs method. Here, ๐ฝ ห ๐ก is a convex combination of ๐ฝ and its maximum likelihood estimate ๐ฝ ^ ๐ก at time ๐ก . The existence of ๐ min can be ensured by performing forced exploration at the beginning of the algorithm, incurring a sampling cost of ๐ โ ( ๐ ) (Kveton et al. 2023).
{assumption}
The derivative of the inverse link function, ๐ link , is bounded, i.e., ๐ min โค ๐ ห link โ ( ๐ โค โ ๐ฝ ห ๐ก ) , for some ๐ min โ โ + and all arms.
Assumption 6 is standard in the GLM literature (Li et al. 2017, Azizi et al. 2021b), ensuring that the reward function is sufficiently smooth, with ๐ min
0 typically determined by the choice of link function.
6.1Algorithm with GLM
In this section, we present a refined algorithm for the generalized linear model, referred to as LinFACT-GLM. This refinement involves modifying the sampling rule in Algorithm 4.1 and adjusting the estimation method. The designed sampling policy is described by
{ ๐ ๐ โ ( ๐ )
โ 2 โ ๐ โ ๐ ๐ โ ( ๐ ) ๐ ๐ 2 โ ๐ min 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ
๐ ๐
โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) ,
(40)
where ๐ min is the known constant controlling the first-order derivative of the inverse link function.
Algorithm 6 Subroutine: G-Optimal Sampling with GLM
1:Input: Projected active set ๐ ๐ผ โ ( ๐ โ 1 ) , round ๐ , ๐ฟ . 2:Find the G-optimal design ๐ ๐ โ ๐ซ โ ( ๐ โ ( ๐ โ 1 ) ) with support size Supp โ ( ๐ ๐ ) โค ๐ โ ( ๐ + 1 ) 2 according to equation (13). 3:for all ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) do โณ Sampling 4:โโSample arm ๐ for ๐ ๐ โ ( ๐ ) times in round ๐ , as specified in equation (40).
In the GLM setting, Ordinary Least Squares (OLS) is also not applicable. Instead, the estimator for each ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) is obtained by solving the optimization problem described in equation (38), using the estimating equation in (39) derived from the observed data.
6.2Upper Bound for the GLM-Based LinFACT
Let ๐ ๐บ GLM denote the number of samples collected under the GLM setting. The following theorem provides an upper bound on the expected sample complexity of LinFACT-GLM.
Theorem 6.1 (Upper Bound, Generalized Linear Model)
For ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 , there exists an event โฐ such that โ โ ( โฐ ) โฅ 1 โ ๐ฟ . On this event, the LinFACT-GLM algorithm achieves an expected sample complexity upper bound given by
๐ผ โ [ ๐ GLM โฃ โฐ ]
๐ช โ ( ๐ ๐ min 2 โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log 2 โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(41)
The upper bound presented in Theorem 6.1, which generalizes the model to the GLM setting, can be viewed as an extension of Theorem 4.1. The detailed proof is provided in Section EC.11 of the online appendix.
7Numerical Experiments
In the numerical experiments, we compare our algorithm, LinFACT, with several baseline methods. These include the Bayesian optimization algorithm based on the knowledge-gradient acquisition function with correlated beliefs for best arm identification (KGCB) proposed by Negoescu et al. (2011); the gap-based algorithm for best arm identification (BayesGap) introduced by Hoffman et al. (2013); the track-and-stop algorithm for threshold bandits (Lazy TTS) developed by Rivera and Tewari (2024); and two gap-based algorithms for top ๐ arm identification, LinGIFA and ๐ -LinGapE, presented by Rรฉda et al. (2021a), which represent the state-of-the-art for returning multiple candidates.
Identifying all ๐ -best arms is often more challenging than identifying the top ๐ arms or arms above a given threshold. To address this, we adopt a random setting where both the number of ๐ -best arms and the ๐ -threshold are randomly sampled. In this setting, top ๐ algorithms and threshold bandit algorithms only have access to the expected reward values, ensuring a fair comparison. Since BayesGap and KGCB operate under a fixed-budget setting, we use the average sample complexity of LinFACT as the budget for comparison and evaluate their performance accordingly. For BAI algorithms, we select arms whose empirical means are within ๐ of the empirical best arm once the budget is exhausted.
(a)Synthetic I - Adaptive Setting (b)Synthetic II - Static Setting Figure 4:Illustration of the Synthetic Experiment Settings
Note. In the adaptive setting (a), we randomly sample the threshold ๐ ~ from a distribution with mean 1, and independently sample the number of ๐ -best arms ๐ ~ from a distribution with mean ๐ . The arms above and below the threshold are then uniformly drawn from the intervals [ ๐ ~ , ๐ ~ + ๐ ] and [ ๐ ~ โ ๐ , ๐ ~ ) , respectively. In the static setting (b), we fix the threshold at ฮ / 2 and randomly sample ๐ ~
๐ -best arms with reward ฮ .
7.1Synthetic Experiments
Following Soare et al. (2014), Xu et al. (2018), and Azizi et al. (2021b), we categorize synthetic data into two types: adaptive and static settings. Figure 4 illustrates the construction of these synthetic datasets, with detailed configurations provided in Section EC.12. A summary of all settings is presented in Table 1.
In the adaptive setting, arms are divided into three categories: (1) the arms to be selected (i.e., the all ๐ -best arms), (2) disturbing arms that are slightly worse, and (3) base arms with zero rewards. The primary challenge for algorithms is to distinguish between the arms in categories (1) and (2), while the base arms in category (3) can be ignored. Adaptive algorithms that effectively leverage shared information to explore similar arms perform well in this setting.
In the static setting, arms are divided into two categories: the all ๐ -best arms and the base arms with zero rewards. In this case, algorithms must distinguish between all arms. Static algorithms that uniformly explore all arms are well-suited for this setting.
Table 1:Synthetic Experiment Settings Setting Index Setting Category Setting Details 1 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 8 , 4 ) , ๐
0.1
2 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 8 , 4 ) , ๐
0.2
3 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 8 , 4 ) , ๐
0.3
4 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 12 , 4 ) , ๐
0.1
5 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 12 , 4 ) , ๐
0.2
6 Adaptive ( ๐ , ๐ผ โ [ ๐ ] )
( 12 , 4 ) , ๐
0.3
7 Static ( ๐ , ฮ )
( 8 , 1 )
8 Static ( ๐ , ฮ )
( 12 , 1 )
9 Static ( ๐ , ฮ )
( 16 , 1 ) Remark 7.1
A common misconception is that adaptive algorithms universally outperform static ones. While adaptive algorithms are typically efficient at focusing exploration on promising arms in many settings, they can be less effective in static environments. In such cases, adaptive methods may inefficiently allocate samples between both candidate and baseline arms, leading to redundant exploration. In contrast, static algorithms can achieve the objective more efficiently by uniformly allocating samples across all arms, avoiding bias and over-exploration.
Experiment Setup. We benchmark our algorithms, LinFACT-G and LinFACT- ๐ณ โ ๐ด , against BayesGap, KGCB, LinGIFA, m-LinGapE, and Lazy TTS, focusing on both sample complexity and the ๐น โ 1 score.
We conduct each experiment using different data types (adaptive or static), arm dimensions ( ๐ ), and numbers of arms ( ๐พ ). For each configuration, we generate 10 values of ๐ from a normal distribution centered at the expected value ๐ผ โ [ ๐ ] with a variance of 3.0, where ๐ is the input for a top- ๐ algorithm. For each sampled pair ( ๐ ~ , ๐ ~ ) , where ๐ ~ denotes the number of ๐ -best arms and ๐ ~ represents the value of the best arm minus ๐ , we repeat the experiment 100 times. We then compute the average ๐น โ 1 score across the 10 ( ๐ ~ , ๐ ~ ) pairs, resulting in 1,000 total executions per algorithm.
In practice, we observe that KGCB, LinGIFA, m-LinGapE, and Lazy TTS are computationally intensive when the sampling budget is high. The time-consuming nature of KGCB has already been noted in the literature (Negoescu et al. 2011). For Lazy TTS, the algorithm requires repeatedly evaluating an objective function within an optimization problem, where each evaluation has a time complexity of ๐ โ ( ๐พ โ ๐ 2 + ๐ 3 ) . As the optimization process involves a non-negligible number of iterations ( ๐ ), the total time complexity becomes ๐ โ ( ๐ โ ๐พ โ ๐ 2 ) , making the algorithm inefficient.
For the two top- ๐ algorithms, the computational burden arises from performing matrix inversions for all arms, leading to a total time complexity of ๐ โ ( ๐พ โ ๐ 3 ) . In contrast, LinFACT achieves a significantly lower total time complexity of ๐ โ ( ๐พ โ ๐ 2 ) , as the optimization problem within our algorithm can be efficiently solved using a fixed-step gradient descent method. Table 2 presents the runtime for synthetic data, demonstrating that our algorithm is at least five times faster than all other methods. Notably, this performance gap is even more pronounced when using real data.
Table 2:Running Time (seconds) for Different Synthetic Experiments Among Algorithms Algorithm Settings Adaptive Settings Static Settings
( ๐ , ๐ผ โ [ ๐ ] )
(
8
,
4
)
(
๐
,
๐ผ
โ
[
๐
]
)
(
12
,
4
)
ฮ
1
๐
0.1
๐
0.2
๐
0.3
๐
0.1
๐
0.2
๐
0.3
๐
8
๐
12
๐
16
LinFACT-G 0.028
0.030
0.028
0.078
0.078
0.076
0.005
0.007
0.009
LinFACT- ๐ณ โ ๐ด
0.037
0.038
0.038
0.163
0.155
0.145
0.015
0.028
0.049
BayesGap 0.112
0.110
0.110
0.306
0.304
0.307
0.036
0.071
0.112
LinGIFA 0.313
0.265
0.236
1.943
1.484
1.424
0.160
0.529
1.297
m-LinGapE 0.195
0.166
0.157
1.362
1.053
0.984
0.116
0.352
0.932
Lazy TTS 0.991
3.904
12.512
24.583
26.941
25.622
0.198
0.871
2.164
KGCB 1.990
1.936
1.670
6.245
6.611
6.082
0.303
0.745
1.386
Note. The best result is in bold and the second best is underlined.
Experiment Results. Our experimental results are presented in Figures 5 and 6. In Figure 5, the vertical axis denotes the ๐น โ 1 score, with higher values indicating better algorithm performance. The first row of six plots shows the results under adaptive settings. As ๐ increases, the non-optimal (disturbing) arms in the adaptive setting move progressively farther from the optimal arms, making them easier to distinguish. Consequently, the ๐น โ 1 score increases from left to right. An exception is BayesGap, which performs best when ๐
0.2 . This occurs because best-arm identification algorithms, such as BayesGap, struggle to differentiate optimal arms from disturbing ones when they are close ( ๐
0.1 ) and fail to fully explore the optimal arms when they are not closely clustered with the best arm ( ๐
0.3 ).
Our LinFACT algorithms consistently outperform top- ๐ algorithms, BayesGap, and KGCB. While our algorithms perform slightly worse than Lazy TTS for some cases, they have much lower sample complexity, as shown in Figure 6, meaning that Lazy TTS requires substantially more samples to achieve these results. When comparing LinFACT-G and LinFACT- ๐ณ โ ๐ด , we observe that in adaptive settings, the ๐น โ 1 scores are similar, but LinFACT- ๐ณ โ ๐ด achieves lower sample complexity. In static settings, however, LinFACT-G attains a higher ๐น โ 1 score with a reduced sample complexity. This difference stems from the distinct focus of the two designs: the ๐ณ โ ๐ด -optimal design prioritizes pulling arms to obtain better estimates along the directions representing differences between arms, while the G-optimal design aims to improve estimates along the directions representing all arms.
Figure 5: ๐น โ 1 Scores for Different Synthetic Experiments Among Algorithms
Note. The y-axis reports the F1 score, which reflects how accurately each algorithm identifies ๐ -best arms. LinFACT-G and LinFACT- ๐ณ โ ๐ด consistently achieve high F1 scores. While Lazy TTS occasionally attains higher scores, we demonstrate in the next figure that it requires significantly more samples to do so. Detailed configurations of each experimental setting are provided in Table 1.
Figure 6:Sample Complexity for Different Synthetic Experiments Among Algorithms
Note. LinFACT-G and LinFACT- ๐ณ โ ๐ด demonstrate sample complexities comparable to LinGIFA and m-LinGapE while achieving higher F1 scores in the adaptive settings (Settings 1 to 6), and consistently outperform all other algorithms in the static settings (Settings 7 to 9). BayesGap and KGCB are excluded from this comparison as they are designed for the fixed-budget setting, and thus their sample complexity is not well-defined.
7.2Experiments with Real Data - Drug Discovery
Experiment Setup. We adopt the Free-Wilson model (Katz et al. 1977) and use real data from a drug discovery task. The Free-Wilson model is a linear framework in which the overall efficacy of a compound is expressed as the sum of the contributions from each substituent on the base molecule, along with the effect of the base molecule itself (Negoescu et al. 2011).
Figure 7:An Example of Molecule and Substituent Locations with Site 1 , โฆ , Site 5
Note. We begin with a base molecule containing multiple attachment sites for chemical substituents. By varying these substituents, we generate a diverse set of compounds and aim to identify those with desirable properties.
Each compound is modeled as an arm represented by a binary indicator vector. Suppose there are ๐ modification sites, with site ๐ โ [ ๐ ] offering ๐ ๐ alternative substituents. Then each arm ๐ lies in โ 1 + โ ๐ โ [ ๐ ] ๐ ๐ , where the initial entry corresponds to the base molecule (i.e., the intercept term). For each site, the corresponding segment in the vector has exactly one entry set to 1 (indicating the chosen substituent), with the remaining ๐ ๐ โ 1 entries set to 0. This results in a total of โ ๐ โ [ ๐ ] ๐ ๐ unique compound configurations.
We conducted our experiments using data as described in Katz et al. (1977), retaining only the non-zero entries. The base molecule, illustrated in Figure 7, contains five sites where substituents can be attached. Each site offers 4, 5, 4, 3, and 4 candidate substituents, respectively, resulting in 960 possible compounds. Each arm ๐ is represented as a vector in โ 21 .
We benchmarked our algorithms, LinFACT-G and LinFACT- ๐ณ โ ๐ด , against LinGIFA and m-LinGapE4, evaluating how precision, recall, ๐น โ 1 score, and sample complexity vary with the failure probability ๐ฟ . In this study, the good set was defined to include 20 ๐ -best arms, which corresponds to ๐
4.325 . All methods were tested over 10 trials, with ๐ฟ ranging from 0.1 to 0.9 in increments of 0.1.
Experiment results. The experimental results are shown in Figure 8. LinFACT-G and LinFACT- ๐ณ โ ๐ด consistently deliver high precision, recall, and ๐น โ 1 scores across various failure probabilities ๐ฟ , as shown in Figures 8aโ8c. In particular, their precision remains close to 1.0, indicating that nearly all selected arms are truly ๐ -best. Their recall is also robust across ๐ฟ , suggesting strong coverage of the good set with minimal omission. This balance between high precision and recall leads to ๐น โ 1 scores that remain consistently near optimal, even as ๐ฟ varies from 0.1 to 0.9. In contrast, LinGIFA and m-LinGapE exhibit larger fluctuations and overall lower values in all three metrics, with especially degraded recall and ๐น โ 1 performance at moderate values of ๐ฟ .
In addition, as illustrated in Figure 8d, LinFACT- ๐ณ โ ๐ด achieves the lowest sample complexity, followed closely by LinFACT-G, with both outperforming all baseline methods. These performance disparities highlight the reliability and robustness of LinFACT methods. Finally, LinFACT-G and LinFACT- ๐ณ โ ๐ด also demonstrate strong computational efficiency: both complete the task within one minute, whereas LinGIFA and m-LinGapE take about four times longer, and Lazy TTS requires over 30 minutes.
(a)Precision (b)Recall (c) ๐น โ 1 Score (d)Sample Complexity Figure 8:Precision, Recall, and ๐น โ 1 Score for Various Failure Probabilities ๐ฟ
Note. LinFACT-G and LinFACT- ๐ณ โ ๐ด consistently demonstrate the best performance across all metrics. LinFACT- ๐ณ โ ๐ด achieves the lowest sample complexity while maintaining high accuracy. BayesGap and KGCB are excluded as they operate under a fixed-budget setting. Lazy TTS is also omitted due to excessive runtime.
8Conclusion
In this paper, we address the challenge of identifying all ๐ -best arms in linear bandits, motivated by applications such as drug discovery. We establish the first information-theoretic lower bound to characterize the problemโs complexity and derive a matching upper bound. Our LinFACT algorithm achieves instance-optimal performance up to a logarithmic factor under the ๐ณ โ ๐ด -optimal design criterion.
We further extend our analysis to settings with model misspecification and generalized linear models (GLMs), deriving new upper bounds and providing insights into algorithmic behavior under these broader conditions. These results generalize and recover the guarantees from the perfectly linear case as special instances. Our numerical experiments confirm that LinFACT outperforms existing methods in both sample and computational efficiency, while maintaining high accuracy in identifying all ๐ -best arms.
Future Research Directions. First, while our current work focuses on the fixed-confidence setting, many real-world applications operate under a fixed sampling budget, where the objective is to achieve the best outcome within a limited number of trials. Extending the algorithm to this fixed-budget setting and establishing corresponding theoretical guarantees remains an important avenue for exploration. Moreover, deriving fundamental lower bounds in the fixed-budget regime is still an open question. Second, although we have proposed extensions for both misspecified linear bandits and generalized linear models (GLMs), future work could benefit from developing separate algorithms tailored to each setting. Such targeted designs may offer improved performance.
\c@NAT@ctr References Abbasi-Yadkori et al. (2011) โ Abbasi-Yadkori Y, Pรกl D, Szepesvรกri C (2011) Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. Abe and Long (1999) โ Abe N, Long PM (1999) Associative reinforcement learning using linear probabilistic concepts. ICML, 3โ11 (Citeseer). Abernethy et al. (2016) โ Abernethy JD, Amin K, Zhu R (2016) Threshold bandits, with and without censored feedback. Advances In Neural Information Processing Systems 29. Al Marjani et al. (2022) โ Al Marjani A, Kocak T, Garivier A (2022) On the complexity of all ๐ -best arms identification. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 317โ332 (Springer). Alaei et al. (2022) โ Alaei S, Makhdoumi A, Malekian A, Pekeฤ S (2022) Revenue-sharing allocation strategies for two-sided media platforms: Pro-rata vs. user-centric. Management Science 68(12):8699โ8721. Allen-Zhu et al. (2017) โ Allen-Zhu Z, Li Y, Singh A, Wang Y (2017) Near-optimal discrete optimization for experimental design: A regret minimization approach. Allen-Zhu et al. (2021) โ Allen-Zhu Z, Li Y, Singh A, Wang Y (2021) Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming 186:439โ478. Azizi et al. (2021) โ Azizi MJ, Kveton B, Ghavamzadeh M (2021) Fixed-budget best-arm identification in structured bandits. arXiv preprint arXiv:2106.04763 . Bechhofer (1954) โ Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. The Annals of Mathematical Statistics 16โ39. Boyd and Bilegan (2003) โ Boyd EA, Bilegan IC (2003) Revenue management and e-commerce. Management science 49(10):1363โ1386. Bubeck et al. (2012) โ Bubeck S, Cesa-Bianchi N, et al. (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trendsยฎ in Machine Learning 5(1):1โ122. Bubeck et al. (2013) โ Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. International Conference on Machine Learning, 258โ265 (PMLR). Chen et al. (2000) โ Chen CH, Lin J, Yรผcesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems 10:251โ270. Das et al. (2021) โ Das P, Sercu T, Wadhawan K, Padhi I, Gehrmann S, Cipcigan F, Chenthamarakshan V, Strobelt H, Dos Santos C, Chen PY, et al. (2021) Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering 5(6):613โ623. Elmaghraby and Keskinocak (2003) โ Elmaghraby W, Keskinocak P (2003) Dynamic pricing in the presence of inventory considerations: Research overview, current practices, and future directions. Management science 49(10):1287โ1309. Even-Dar et al. (2006) โ Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research 7(6). Fan et al. (2016) โ Fan W, Hong LJ, Nelson BL (2016) Indifference-zone-free selection of the best. Operations Research 64(6):1499โ1514. Feng et al. (2025) โ Feng Q, Ma T, Zhu R (2025) Satisficing regret minimization in bandits. Proceedings of the 13th International Conference on Learning Representations. Feng et al. (2022) โ Feng Y, Caldentey R, Ryan CT (2022) Robust learning of consumer preferences. Operations Research 70(2):918โ962. Fiez et al. (2019) โ Fiez T, Jain L, Jamieson KG, Ratliff L (2019) Sequential experimental design for transductive linear bandits. Advances in neural information processing systems 32. Frazier et al. (2009) โ Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing 21(4):599โ613. Frazier et al. (2008) โ Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47(5):2410โ2439. Free and Wilson (1964) โ Free SM, Wilson JW (1964) A mathematical contribution to structure-activity studies. Journal of medicinal chemistry 7(4):395โ399. Gabillon et al. (2012) โ Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems 25. Ghosh et al. (2017) โ Ghosh A, Chowdhury SR, Gopalan A (2017) Misspecified linear bandits. Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Godinho de Matos et al. (2018) โ Godinho de Matos M, Ferreira P, Smith MD (2018) The effect of subscription video-on-demand on piracy: Evidence from a household-level randomized experiment. Management Science 64(12):5610โ5630. Hoffman et al. (2013) โ Hoffman MW, Shahriari B, de Freitas N (2013) Exploiting correlation and budget constraints in bayesian multi-armed bandit optimization. arXiv preprint arXiv:1303.6746 . Hong et al. (2021) โ Hong LJ, Fan W, Luo J (2021) Review on ranking and selection: A new perspective. Frontiers of Engineering Management 8(3):321โ343. Huang et al. (2007) โ Huang Z, Zeng DD, Chen H (2007) Analyzing consumer-product graphs: Empirical findings and applications in recommender systems. Management science 53(7):1146โ1164. Kalyanakrishnan and Stone (2010) โ Kalyanakrishnan S, Stone P (2010) Efficient selection of multiple bandit arms: Theory and practice. ICML, volume 10, 511โ518. Kalyanakrishnan et al. (2012) โ Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) Pac subset selection in stochastic multi-armed bandits. ICML, volume 12, 655โ662. Katz et al. (1977) โ Katz R, Osborne SF, Ionescu F (1977) Application of the free-wilson technique to structurally related series of homologs. quantitative structure-activity relationship studies of narcotic analgetics. Journal of Medicinal Chemistry 20(11):1413โ1419. Kaufmann et al. (2016) โ Kaufmann E, Cappรฉ O, Garivier A (2016) On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research 17:1โ42. Kaufmann and Kalyanakrishnan (2013) โ Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Conference on Learning Theory, 228โ251 (PMLR). Kim and Nelson (2001) โ Kim SH, Nelson BL (2001) A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS) 11(3):251โ273. Kim and Nelson (2006) โ Kim SH, Nelson BL (2006) Selecting the best system. Handbooks in operations research and management science 13:501โ534. Koenig and Law (1985) โ Koenig LW, Law AM (1985) A procedure for selecting a subset of size m containing the l best of k independent normal populations, with applications to simulation. Communications in Statistics-Simulation and Computation 14(3):719โ734. Komiyama et al. (2023) โ Komiyama J, Ariu K, Kato M, Qin C (2023) Rate-optimal bayesian simple regret in best arm identification. Mathematics of Operations Research . Kveton et al. (2023) โ Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2023) Randomized exploration in generalized linear bandits. Lattimore and Szepesvรกri (2020) โ Lattimore T, Szepesvรกri C (2020) Bandit algorithms (Cambridge University Press). Lattimore et al. (2020) โ Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Li et al. (2017) โ Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. International Conference on Machine Learning, 2071โ2080 (PMLR). Li et al. (2024) โ Li Z, Fan W, Hong LJ (2024) The (surprising) sample optimality of greedy procedures for large-scale ranking and selection. Management Science . Locatelli et al. (2016) โ Locatelli A, Gutzeit M, Carpentier A (2016) An optimal algorithm for the thresholding bandit problem. International Conference on Machine Learning, 1690โ1698 (PMLR). Mannor and Tsitsiklis (2004) โ Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(Jun):623โ648. Mason et al. (2020) โ Mason B, Jain L, Tripathy A, Nowak R (2020) Finding all ๐ -good arms in stochastic bandits. Advances in Neural Information Processing Systems 33:20707โ20718. McCullagh (2019) โ McCullagh P (2019) Generalized linear models (Routledge). Negoescu et al. (2011) โ Negoescu DM, Frazier PI, Powell WB (2011) The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing 23(3):346โ363. Peukert et al. (2023) โ Peukert C, Sen A, Claussen J (2023) The editor and the algorithm: Recommendation technology in online news. Management science . Pukelsheim (2006) โ Pukelsheim F (2006) Optimal design of experiments (SIAM). Qin and You (2025) โ Qin C, You W (2025) Dual-directed algorithm design for efficient pure exploration. Operations Research . Rรฉda et al. (2021a) โ Rรฉda C, Kaufmann E, Delahaye-Duriez A (2021a) Top-m identification for linear bandits. International Conference on Artificial Intelligence and Statistics, 1108โ1116 (PMLR). Rรฉda et al. (2021b) โ Rรฉda C, Tirinzoni A, Degenne R (2021b) Dealing with misspecification in fixed-confidence linear top-m identification. Advances in Neural Information Processing Systems 34:25489โ25501. Rivera and Tewari (2024) โ Rivera EO, Tewari A (2024) Optimal thresholding linear bandit. arXiv preprint arXiv:2402.09467 . Russo (2020) โ Russo D (2020) Simple bayesian algorithms for best-arm identification. Operations Research 68(6):1625โ1647. Ryzhov et al. (2012) โ Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1):180โ195. Shen et al. (2021) โ Shen H, Hong LJ, Zhang X (2021) Ranking and selection with covariates for personalized decision making. INFORMS Journal on Computing 33(4):1500โ1519. Shin et al. (2018) โ Shin D, Broadie M, Zeevi A (2018) Tractable sampling strategies for ordinal optimization. Operations Research 66(6):1693โ1712. Simchi-Levi et al. (2024) โ Simchi-Levi D, Wang C, Xu J (2024) On experimentation with heterogeneous subgroups: An asymptotic optimal ๐ฟ -weighted-pac design. SSRN Electronic Journal . Soare et al. (2014) โ Soare M, Lazaric A, Munos R (2014) Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27. Thompson (1933) โ Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3-4):285โ294. Thornton et al. (2013) โ Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 847โ855. Wolke and Schwetlick (1988) โ Wolke R, Schwetlick H (1988) Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons. SIAM journal on scientific and statistical computing 9(5):907โ921. Xu et al. (2018) โ Xu L, Honda J, Sugiyama M (2018) A fully adaptive algorithm for pure exploration in linear bandits. International Conference on Artificial Intelligence and Statistics, 843โ851 (PMLR). Yang and Tan (2022) โ Yang J, Tan V (2022) Minimax optimal fixed-budget best arm identification in linear bandits. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds., Advances in Neural Information Processing Systems, volume 35, 12253โ12266 (Curran Associates, Inc.). Zanette et al. (2020) โ Zanette A, Lazaric A, Kochenderfer M, Brunskill E (2020) Learning near optimal policies with low inherent bellman error. International Conference on Machine Learning, 10978โ10989 (PMLR).
E-Companion โ Identifying All ๐ -Best Arms In Linear Bandits With Misspecification
Appendix EC.1Additional Literature EC.1.1Misspecified Linear Bandits.
The linear bandit (LB) problem, introduced by Abe and Long (1999), extends the multi-armed bandits (MABs) framework by incorporating structural relationships among different arms. In the context of best arm identification, Garivier and Kaufmann (2016) established a classical lower bound, which was later extended to linear bandits by Fiez et al. (2019) using transportation inequalities.
The foundational study of linear bandits in the pure exploration framework was conducted by Hoffman et al. (2014), who addressed the best arm identification (BAI) problem in a fixed-budget setting while considering correlations among arm distributions. They proposed BayesGap, a Bayesian variant of the gap-based exploration algorithm (Gabillon et al. 2012). Although BayesGap outperformed methods that ignore correlations and structural relationships, its limitation of ceasing to pull arms deemed sub-optimal hindered its effectiveness in linear bandit pure exploration.
A key distinction between stochastic MABs and linear bandits is that, in MABs, once an armโs sub-optimality is confirmed with high probability, it is no longer pulled. In linear bandits, however, even sub-optimal arms can offer valuable information about the parameter vector, improving confidence in estimates and aiding the discrimination of near-optimal arms. This insight has led to the adoption of optimal linear experiment design as a crucial framework for linear bandit pure exploration (Abbasi-Yadkori et al. 2011, Soare et al. 2014, Fiez et al. 2019, Rรฉda et al. 2021, Yang and Tan 2021, Azizi et al. 2021a).
When applying linear models to real data, misspecification inevitably arises in situations where the data deviates from perfect linearity. The concept of misspecified bandit models was introduced in the context of cumulative regret by Ghosh et al. (2017), who demonstrated a significant limitation: any linear bandit algorithm (e.g., OFUL (Abbasi-Yadkori et al. 2011) or LinUCB (Li et al. 2010)), which achieves optimal regret bounds on perfectly linear instances, can suffer linear regret on certain misspecified models. To address this, they proposed a hypothesis-test-based algorithm that avoids linear regret and achieves UCB-type sublinear regret for models with non-sparse deviations from linearity. Lattimore et al. (2020) further analyzed misspecification, showing that elimination-based algorithms with G-optimal design perform well under misspecification but incur an additional linear regret term proportional to the misspecification magnitude over the horizon.
In the pure exploration setting, misspecified linear models were first studied in the context of identifying the top ๐ best arms by Rรฉda et al. (2021), who introduced the MisLid algorithm, leveraging orthogonal parameterization to address misspecification. Subsequent research examined misspecification in ordinal optimization (Ahn et al. 2024), proposing prospective sampling methods that reduce the impact of misspecification as the sample size increases. Building on the definition of model misspecification and the optimization approach based on orthogonal parameterization from Rรฉda et al. (2021), we develop new algorithms for identifying all ๐ -best arms in misspecified linear bandits and establish new upper bounds.
EC.1.2Generalized Linear Bandits.
The generalized linear bandit (GLB) model (Filippi et al. 2010, Ahn and Shin 2020, Kveton et al. 2023) extends the multi-armed bandit framework by incorporating generalized linear models (GLMs) (McCullagh 2019) to model expected rewards. Specifically, the expected reward of each arm is given by a known link function applied to the inner product of a feature vector and an unknown parameter vector. Most existing algorithms for generalized linear bandits employ the upper confidence bound (UCB) approach, with randomized GLM algorithms (Chapelle and Li 2011, Russo et al. 2018, Kveton et al. 2023) demonstrating superior performance.
In the context of pure exploration, Azizi et al. (2021b) introduced the first practical algorithm for best arm identification in generalized linear bandits, supported by theoretical analysis. Their work extends the best arm identification problem from linear models to more complex settings where the relationship between features and rewards follows generalized linear models (GLMs). Building on this foundation, we extend the pure exploration setting from best arm identification (BAI) to identifying all ๐ -best arms, providing analogous analyses and theoretical results for GLMs.
Appendix EC.2General Pure Exploration Model
In this section, we present a brief discussion about the general pure exploration problem. A more comprehensive explanation can be found in Qin and You (2025).
The decision-maker seeks to answer a query concerning the mean parameters ๐ by adaptively allocating the sampling budget across the available arms. This query typically involves identifying a subset of arms that satisfy certain criteria, and the goal is to determine the correct answer with high probability. Let โ
โ โ ( ๐ ) denote the correct answer, which in our setting corresponds to the set of all ๐ -best arms. Define ๐ ~ as the set of parameters that yield a unique answer. Let ฮ represent the collection of all possible answers, and for each โ โฒ โ ฮ , define ๐ โ โฒ โ { ๐ โ ๐ ~ : โ โ ( ๐ )
โ โฒ } as the set of parameters for which โ โฒ is the correct answer. The overall parameter space of interest is then given by ๐ โ โ โ โฒ โ ฮ ๐ โ โฒ .
Recall that an algorithm is defined as a triplet โ
( ๐ด ๐ก , ๐ ๐ฟ , ๐ ^ ๐ ) . The algorithmโs sample complexity is quantified by the number of samples, denoted as ๐ ๐ฟ , at the point of termination. The objective is to formulate algorithms that minimize the expected sample complexity ๐ผ ๐ โ [ ๐ ๐ฟ ] across the set โ . As stated in Kaufmann et al. (2016), when ๐ฟ โ ( 0 , 1 ) , the non-asymptotic problem complexity of an instance ๐ can be defined as
๐ โ ( ๐ ) โ inf ๐ด โ ๐ โ ๐ โ ๐ โ โ ๐ผ ๐ โ [ ๐ ๐ฟ ] log โก ( 1 / 2.4 โ ๐ฟ ) .
(EC.2.1)
This instance-dependent complexity indicates the smallest possible constant such that the expected sample complexity ๐ผ ๐ โ [ ๐ ๐ฟ ] scales in alignment with log โก ( 1 / 2.4 โ ๐ฟ ) . The problem complexity ๐ โ ( ๐ ) is subject to an information-theoretic lower bound. This lower bound can be expressed as the optimal solution of an allocation problem, which we present in Proposition 3.1. To build this framework, we next introduce three important concepts: culprits, alternative sets, and ๐ถ ๐ฅ function.
EC.2.1Culprits and Alternative Sets
Let ๐ณ โ ( ๐ ) denote the set of culprits under the true mean vector ๐ . These culprits are responsible for deviations from the correct answer โ โ ( ๐ ) . The structure of ๐ณ โ ( ๐ ) varies depending on the specific exploration task, and identifying these culprits is essential for characterizing the problemโs complexity and guiding the design of effective algorithms.
To identify the correct answer, an algorithm must distinguish among different instances within the parameter space ๐ . Accordingly, for any instance ๐ โ ๐ , the instance-dependent problem complexity ๐ โ ( ๐ ) is determined by the structure of the corresponding alternative set
โช ๐ฅ โ ๐ณ โ ( ๐ ) Alt ๐ฅ โ ( ๐ )
Alt โ ( ๐ ) โ { ๐ โ ๐ : โ โ ( ๐ ) โ โ โ ( ๐ ) } ,
(EC.2.2)
which represents the set of parameters that return a solution that is different from the correct solution โ โ ( ๐ ) .
As an example, consider the task of identifying the single best arm. In this case, the culprit set is ๐ณ โ ( ๐ )
[
๐พ
]
{
๐ผ
โ
โ
(
๐
)
}
, consisting of all arms except the current best arm. For each culprit
๐ฅ
โ
๐ณ
โ
(
๐
)
, if there exists a parameter
๐
under which arm
๐ฅ
has a higher mean than
๐ผ
โ
โ
(
๐
)
, then
๐
leads to an incorrect identification caused by
๐ฅ
. Each such culprit is associated with an alternative setโnamely, the set of parameters that yield a wrong answer due to
๐ฅ
โgiven by
Alt
๐ฅ
โ
(
๐
)
{ ๐ โ ๐ : ๐ ๐ฅ โฅ ๐ ๐ผ โ โ ( ๐ ) } for ๐ฅ โ ๐ณ โ ( ๐ ) .
EC.2.2 ๐ช ๐ unction
The task of identifying the correct answer can be formulated as a sequential hypothesis testing problem, which can be addressed using the Sequential Generalized Likelihood Ratio (SGLR) test (Kaufmann et al. 2016, Kaufmann and Koolen 2021). The SGLR statistic is defined to test a potentially composite null hypothesis ๐ป 0 : ( ๐ โ ฮฉ 0 ) against a potentially composite alternative hypothesis ๐ป 1 : ( ๐ โ ฮฉ 1 ) , and is given by:
SGLR ๐ก
sup ๐ โ ฮฉ 0 โช ฮฉ 1 ๐ฟ โ ( ๐ 1 , ๐ 2 , โฆ , ๐ ๐ก ; ๐ ) sup ๐ โ ฮฉ 0 ๐ฟ โ ( ๐ 1 , ๐ 2 , โฆ , ๐ ๐ก ; ๐ ) ,
(EC.2.3)
where ๐ 1 , ๐ 2 , โฆ , ๐ ๐ก are the observed values from arm pulls, and ๐ฟ โ ( โ ) denotes the likelihood function based on these observations and an unknown parameter ๐ . The set ฮฉ 0 corresponds to the restricted parameter space under the null hypothesis, while ฮฉ 0 โช ฮฉ 1 defines the full parameter space under consideration, encompassing both the null and alternative hypotheses. These sets correspond to the alternative regions introduced in the previous section. A large value of SGLR ๐ก indicates stronger evidence against the null hypothesis and supports rejecting it.
We consider distributions from a single-parameter exponential family parameterized by their means, following the formulation in Garivier and Kaufmann (2016). This family includes the Bernoulli, Poisson, and Gamma distributions with known shape parameters, as well as the Gaussian distribution with known variance. For each culprit ๐ฅ โ ๐ณ โ ( ๐ ^ ) , where ๐ ^ is the empirical mean based on observed data, we test the hypotheses: ๐ป 0 , ๐ฅ : ๐ โ Alt ๐ฅ โ ( ๐ ^ ) versus ๐ป 1 , ๐ฅ : ๐ โ Alt ๐ฅ โ ( ๐ ^ ) . When ๐ ^ โ ( ๐ก ) โ ฮฉ 0 โช ฮฉ 1 , the generalized likelihood ratio statistic in equation (EC.2.3) can be expressed in terms of a self-normalized sum. This leads to a formal expression of the SGLR statistic in Proposition EC.1, derived through maximum likelihood estimation and a reformulation of the KL divergence.
Proposition EC.1 (Kaufmann and Koolen (2021))
The generalized likelihood ratio statistic for each culprit ๐ฅ โ ๐ณ โ ( ๐ ) at time step ๐ก is defined as
ฮ ^ ๐ก , ๐ฅ
ln โก ( SGLR ๐ก )
inf ๐ โ Alt ๐ฅ โ ( ๐ ^ โ ( ๐ก ) ) โ ๐ โ [ ๐พ ] ๐ ๐ โ ( ๐ก ) โ KL โ ( ๐ ^ ๐ โ ( ๐ก ) , ๐ ๐ ) ,
(EC.2.4)
where KL ( โ , โ ) represents the KL divergence of the two distributions parameterized by their means, and ๐ ๐ โ ( ๐ก )
๐ก โ ๐ ๐ is the expected number of observations allocated to arm ๐ โ [ ๐พ ] up to time ๐ก .
This proposition links the SGLR test to information-theoretic methods. To quantify the information and confidence required to assert that the true mean does not lie in Alt ๐ฅ for all ๐ฅ โ ๐ณ , we define the ๐ถ ๐ฅ function as the population version of the SGLR statistic, sharing the same form as equation (EC.2.4).
๐ถ ๐ฅ โ ( ๐ )
๐ถ ๐ฅ โ ( ๐ ; ๐ ) โ inf ๐ โ Alt ๐ฅ โ ๐ โ [ ๐พ ] ๐ ๐ โ KL โ ( ๐ ๐ , ๐ ๐ ) .
(EC.2.5)
With the introduction of culprits and the ๐ถ ๐ฅ function, we arrive at the optimal allocation problem that defines the lower bound stated in Proposition 3.1. However, computing the lower bound can still be hard since it requires the solution of the minimax problem in equation (18). While the KL divergence in equation (EC.2.5) is convex for Gaussians, it can be non-convex to minimize the ๐ถ ๐ฅ function over the culprit set ๐ณ โ ( ๐ ) . To solve this problem, based on the following Proposition EC.2, we can write ๐ณ โ ( ๐ ) as a union of several convex sets. The following three equivalent expressions represent different ways of describing the lower bound, making the minimax problem in equation (18) tractable for every ๐ฅ โ ๐ณ .
ฮ ๐ โ
max ๐ โ ๐ฎ ๐พ โ inf ๐ โ Alt โก ( ๐ ) โ ๐ โ [ ๐พ ] ๐ ๐ โ KL โ ( ๐ ๐ , ๐ ๐ )
max ๐ โ ๐ฎ ๐พ โก min ๐ฅ โ ๐ณ โ inf ๐ โ Alt ๐ฅ โ ๐ โ [ ๐พ ] ๐ ๐ โ KL โ ( ๐ ๐ , ๐ ๐ )
(EC.2.6)
= max ๐ โ ๐ฎ ๐พ โก min ๐ฅ โ ๐ณ โ โ ๐ โ [ ๐พ ] ๐ ๐ โ KL โ ( ๐ ๐ , ๐ ๐ ๐ฅ )
(EC.2.7)
= max ๐ โ ๐ฎ ๐พ โก min ๐ฅ โ ๐ณ โก ๐ถ ๐ฅ โ ( ๐ ) ,
(EC.2.8)
where we utilized the existence of a finite union set and a unique minimizer ๐ ๐ฅ in Proposition EC.2.
Proposition EC.2
Assume that the distribution of each arm belongs to a canonical single-parameter exponential family, parameterized by its mean. Then, for each culprit ๐ฅ โ ๐ณ โ ( ๐ ) ,
1.
(Wang et al. 2021) For each problem instance ๐ โ ๐ , the alternative set Alt โ ( ๐ ) is a finite union of convex sets. Namely, there exists a finite collection of convex sets { Alt ๐ฅ โ ( ๐ ) : ๐ฅ โ ๐ณ โ ( ๐ ) } such that Alt โ ( ๐ )
โช ๐ฅ โ ๐ณ โ ( ๐ ) Alt ๐ฅ โ ( ๐ ) .
2.
Given a specific simplex distribution ๐ , there exists a unique ๐ ๐ฅ โ Alt ๐ฅ โ ( ๐ ) that achieves the infimum in equation (EC.2.5).
Proof EC.3
Proof. The proof proceeds in two parts. For the first part, the alternative set for any given culprit ๐ฅ in our setting is given by
Alt ๐ฅ โ ( ๐ )
Alt ๐ , ๐ โ ( ๐ ) โช Alt ๐ โ ( ๐ ) ,
(EC.2.9)
where Alt ๐ , ๐ โ ( ๐ ) and Alt ๐ โ ( ๐ ) are defined in (EC.4.10) and (EC.4.19), respectively. Each Alt ๐ฅ โ ( ๐ ) is convex, as convexity is preserved under unions of convex sets. This can be verified by confirming that any convex combination of two points in Alt ๐ฅ โ ( ๐ ) remains in the set.
For the second part, when the reward distribution belongs to a single-parameter exponential family, the KL divergence KL ( ๐ , ๐ โฒ ) is continuous and strictly convex in ( ๐ , ๐ โฒ ) . This ensures that the infimum in equation (EC.2.5) is achieved uniquely by a single ๐ . \Halmos
EC.2.3Stopping Rule
In this section, we introduce the stopping rule, which suggests when to stop the algorithm and returns an answer that gives all ๐ -best arms with a probability of at least 1 โ ๐ฟ . This stopping rule is based on the deviation inequalities that are linked to the generalized likelihood ratio test (Kaufmann and Koolen 2021).
For each ๐ฅ ๐ โ ๐ณ โ ( ๐ ) with ๐ โ [ | ๐ณ | ] , let ๐ ๐ โ ( ๐ )
Alt ๐ฅ ๐ โ ( ๐ ) denote a partition of the realizable parameter space ๐ introduced in Section EC.2.1, where each partition ๐ ๐ โ ( ๐ ) is associated with a distinct culprit in ๐ณ โ ( ๐ ) . This implies that for any ๐ โ ๐ , the parameter space ๐ can be uniquely partitioned to support a hypothesis testing framework. Let ๐ 0 โ ( ๐ ) denote the subset of parameters where ๐ resides. If ๐ โ ๐ , define ๐ โ โ ( ๐ ) as the index of the unique element in the partition where the true mean value ๐ belongs, and here ๐ โ โ ( ๐ )
0 . In other words, we have ๐ โ ๐ 0 and Alt โ ( ๐ )
๐
๐
0
. Since the ordering among suboptimal arms is irrelevant, the sets
๐
๐
โ
(
๐
)
for
๐
โ
{
0
,
1
,
2
,
โฆ
,
|
๐ณ
|
}
form a valid partition of
๐
for each
๐
. Accordingly, the alternative set can be further defined as
Alt โ ( ๐ )
โช ๐ : ๐ โ ๐ ๐ โ ( ๐ ) ๐ ๐ โ ( ๐ )
๐
๐
๐
โ
โ
(
๐
)
๐
๐
0
.
(EC.2.10)
Given a bandit instance ๐ , we consider a total of | ๐ณ โ ( ๐ ) | + 1 hypotheses, defined as
๐ป 0
( ๐ โ ๐ 0 โ ( ๐ ) ) , ๐ป 1
( ๐ โ ๐ 1 โ ( ๐ ) ) , โฆ , ๐ป | ๐ณ |
( ๐ โ ๐ | ๐ณ | โ ( ๐ ) ) .
(EC.2.11)
By substituting the true mean vector ๐ with its empirical estimate ๐ ^ โ ( ๐ก ) , the SGLR test becomes data-dependent, relying on the empirical means at each time step ๐ก . Consequently, the hypotheses tested at time ๐ก are also data-dependent. If ๐ ^ โ ( ๐ก ) โ ๐ , we define ๐ ^ โ ( ๐ก )
๐ โ โ ( ๐ ^ โ ( ๐ก ) ) as the index of the partition to which ๐ ^ โ ( ๐ก ) belongsโthat is, ๐ ^ โ ( ๐ก ) โ ๐ ๐ ^ โ ( ๐ก ) . If instead ๐ ^ โ ( ๐ก ) โ ๐ , we set ฮ ^ ๐ก , ๐ฅ
0 for all culprits ๐ฅ , meaning no hypothesis test is conducted at that time step, and the process continues. In practice, when ๐ ^ โ ( ๐ก ) โ ๐ , the algorithm can revert to uniform exploration. Since the true mean vector ๐ โ ๐ and ๐ is assumed to be an open set, the law of large numbers guarantees that ๐ ^ โ ( ๐ก ) will eventually re-enter the parameter space, i.e., ๐ ^ โ ( ๐ก ) โ ๐ after sufficient samples.
We run | ๐ณ | time-varying SGLR tests in parallel, each testing ๐ป 0 against ๐ป ๐ for ๐ โ [ | ๐ณ โ ( ๐ ^ โ ( ๐ก ) ) | ] . The procedure stops when any of these tests rejects ๐ป 0 , indicating that the corresponding alternative set is empirically the easiest to reject. At this point, the accepted hypothesis for ๐ ^ โ ( ๐ก ) โ ๐ is identified as the most likely to be correct. Given a sequence of exploration rates ( ๐ฝ ^ ๐ก โ ( ๐ฟ ) ) ๐ก โ โ , the SGLR stopping rule in the pure exploration setting is defined as follows:
๐ ๐ฟ โ inf { ๐ก โ โ : min ๐ฅ โ ๐ณ โ ( ๐ ^ โ ( ๐ก ) ) โก ฮ ^ ๐ก , ๐ฅ > ๐ฝ ^ ๐ก โ ( ๐ฟ ) }
inf { ๐ก โ โ : ๐ก โ min ๐ฅ โ ๐ณ โ ( ๐ ^ โ ( ๐ก ) ) โก ๐ถ ๐ฅ โ ( ๐ ๐ก ; ๐ ^ โ ( ๐ก ) )
๐ฝ ^ ๐ก โ ( ๐ฟ ) } ,
(EC.2.12)
where the SGLR statistic ฮ ^ ๐ก , ๐ฅ is defined in equation (EC.2.4). The testing process closely resembles the classical approach, except that the hypotheses are data-dependent and evolve over time.
We also provide insights from the perspective of the confidence region. Specifically, the event { min ๐ฅ โ ๐ณ โ ( ๐ ^ โ ( ๐ก ) ) โก ฮ ^ ๐ก , ๐ฅ > ๐ฝ ^ ๐ก โ ( ๐ฟ ) }
{ ๐ ๐ก , ๐ฟ โ ๐ ๐ ^ โ ( ๐ก ) } , where ๐ ๐ก , ๐ฟ denotes the confidence region of the mean vector, given by
๐ ๐ก , ๐ฟ โ { ๐ : โ ๐
1 ๐พ ๐ ๐ โ ( ๐ก ) โ KL โ ( ๐ ^ ๐ โ ( ๐ก ) , ๐ ๐ ) โค ๐ฝ ^ ๐ก โ ( ๐ฟ ) } .
(EC.2.13)
Notably, although the confidence region ๐ ๐ก , ๐ฟ is defined for the mean vector ๐ , under the assumption of linear structure, it is equivalent to the confidence region for the parameter ๐ฝ , as discussed in Sections 2.2 and EC.4.1. This equivalence allows the stopping rule to be interpreted as follows: the algorithm halts once the confidence region for the mean vector fully lies within a single partition region, aligning with the graphical interpretation of optimal allocation given in Section EC.4.2.
Appendix EC.3Difference between G-Optimal Design and ๐ณ โ ๐ด -Optimal Design Figure EC.3.1:Key Distinction Between G-Optimal and ๐ณ โ ๐ด -Optimal Designs in Terms of Stopping Criteria
Note. The contraction behavior and rate of the confidence region for the parameter ๐ ^ ๐ก differ: under G-optimal sampling (left), the region shrinks uniformly in all directions, whereas under ๐ณ โ ๐ด -optimal design (right), it contracts more strategically along directions critical for classification, allowing the confidence region to enter a decision region more rapidly and trigger earlier stopping.
Returning to the intuition and visual explanation in Section EC.4.1 and Figure EC.4.1, we see that G-optimal sampling inevitably leads to inefficient sampling. Figure EC.3.1 illustrates the advantage of adopting the ๐ณ โ ๐ด -optimal design from the perspective of the stopping rule: the algorithm terminates once the yellow confidence region for ๐ฝ ^ enters one of the decision regions ๐ ๐ ( ๐
1 , โฆ , 7 ). Unlike the isotropic shrinkage of the confidence region under G-optimal design, the ๐ณ โ ๐ด -optimal design guides the region to contract more aggressively in directions critical for distinguishing arms. Rather than uniformly estimating ๐ฝ , it prioritizes reducing uncertainty along directions that matter most for classification, leading to more efficient exploration.
Appendix EC.4Lower Bound for All ๐ -Best Arms Identification in Linear Bandits
This section provides both geometric insights regarding the stopping condition and formal proofs establishing the lower bound for identifying all ๐ -best arms in linear bandit settings.
EC.4.1Visual Illustration of the Stopping Condition Figure EC.4.1:Visual Illustration of Identifying the Best Arm vs. Identifying All ๐ -Best Arms.
Note. (a) Stopping occurs when the confidence region ๐ ๐ก , ๐ฟ for the estimated parameter ๐ ^ ๐ก contracts entirely within one of the three decision regions ๐ ๐ in a certain time step ๐ก . The boundaries between regions are defined by the hyperplanes ๐ โค โ ( ๐ ๐ โ ๐ ๐ )
0 . Each dot represents an arm. (b) In the case of identifying all ๐ -best arms, the regions overlap. (c) Due to these overlaps, the space is partitioned into seven distinct decision regions, increasing the difficulty of identification.
The stopping condition is formulated as a hypothesis test conducted as data is collected, which can be interpreted as the process of the parameter confidence region contracting into one of the decision regions (i.e., the set of parameters that yield the same decision). A more detailed version of Figure 1 is shown in Figure EC.4.1, which illustrates the core idea of the stopping condition for identifying all ๐ -best arms in linear bandits. The key distinction between identifying all ๐ -best arms and identifying the single best arm lies in how the decision regions are partitioned.
Figure EC.4.1(a) illustrates the best arm identification process in linear bandits, where ๐ โ
๐ โ โ ( ๐ ) represents the arm with the largest mean value for each bandit instance ๐ . Let ๐ ๐
{ ๐ โ โ ๐ โฃ ๐ ๐
๐ โ } be the set of parameters ๐ฝ for which ๐ ๐ ( ๐
1 , 2 , 3 ) is the optimal arm. Each ๐ ๐ forms a cone defined by the intersection of half-spaces.
Figure EC.4.1(b) represents an intermediate step, demonstrating the transition from best arm identification to identifying all ๐ -best arms. Let ๐ ๐ ๐
{ ๐ โ โ ๐ โฃ ๐ โ ๐บ ๐ โ ( ๐ ) } be the set of parameters ๐ฝ include arm ๐ in the set ๐บ ๐ โ ( ๐ ) . ๐ ๐ ๐ is similarly defined by the intersection of half-spaces. The overlap of these three regions forms the decision regions ๐ ๐ ( ๐
1 , 2 , โฆ , 7 ) in Figure EC.4.1(c), which correspond to the seven distinct types of ๐ -best arms sets. Besides, the BAI process in (a) is a special case of the ๐ -best arms identification in (c), occurring when the gap ๐ approaches 0. The following statement provides a detailed explanation of how the decision regions in Figure EC.4.1(c) are constructed.
In a ๐ -dimensional Euclidean space โ ๐ , hyperplanes ๐ ๐ , ๐ can be defined for any pair of arms, partitioning the space into the following half-spaces:
๐ป ๐ , ๐ +
{ ๐ โ โ ๐ โฃ ( ๐ ๐ โ ๐ ๐ ) โค โ ๐
๐ } ,
(EC.4.1)
๐ป ๐ , ๐ โ
{ ๐ โ โ ๐ โฃ ( ๐ ๐ โ ๐ ๐ ) โค โ ๐ โค ๐ } .
(EC.4.2)
The hyperplane ๐ ๐ , ๐ , which separates the half-spaces ๐ป ๐ , ๐ + and ๐ป ๐ , ๐ โ , is perpendicular to the direction vector ๐ ๐ โ ๐ ๐ . The intersection of these half-spaces, represented by โ ๐ , ๐ โ [ ๐พ ] , ๐ โ ๐ ๐ป ๐ , ๐ , partitions the space into distinct regions. Each region corresponds to a solution set, representing all ๐ -best arms if the true parameter ๐ฝ lies within that region. As the gap ๐ approaches 0, the hyperplanes ๐ ๐ , ๐ on both sides of the decision boundaries move closer together, causing some decision regions in Figure EC.4.1(c) to shrink until they vanish. The relationship between the true parameter ๐ฝ and the half-spaces determines the belonging of arms in the good set ๐บ ๐ โ ( ๐ ) .
For the case of three arms, the space is divided into three overlapping regions, as shown in Figure EC.4.1(b). These regions further generate seven ( 2 3 โ 1
7 ) decision regions, denoted ๐ 1 through ๐ 7 , which are summarized in Table EC.4.1.
Table EC.4.1:Three Overlapping Regions and Seven Decision Regions in the Case of Three Arms Decision Region Set Expression ๐ -Best Arms
๐ 1
๐ ๐ 1 โฉ ๐ ๐ 2 ๐ โฉ ๐ ๐ 3 ๐
{ 1 }
๐ 2
๐ ๐ 1 โฉ ๐ ๐ 2 ๐ โฉ ๐ ๐ 3
{ 1 , 3 }
๐ 3
๐ ๐ 1 ๐ โฉ ๐ ๐ 2 ๐ โฉ ๐ ๐ 3
{ 3 }
๐ 4
๐ ๐ 1 ๐ โฉ ๐ ๐ 2 โฉ ๐ ๐ 3
{ 2 , 3 }
๐ 5
๐ ๐ 1 ๐ โฉ ๐ ๐ 2 โฉ ๐ ๐ 3 ๐
{ 2 }
๐ 6
๐ ๐ 1 โฉ ๐ ๐ 2 โฉ ๐ ๐ 3 ๐
{ 1 , 2 }
๐ 7
๐ ๐ 1 โฉ ๐ ๐ 2 โฉ ๐ ๐ 3
{ 1 , 2 , 3 }
The stopping condition verifies whether the confidence region ๐ ๐ก , ๐ฟ is entirely contained within a specific decision region ๐ ๐ . It is important to note that, due to the definition of the confidence region and the property that ๐ฝ ^ ๐ก โ ๐ฝ as ๐ก โ โ , any algorithm that continually samples all arms will eventually meet the stopping condition.
EC.4.2Optimal Allocation
The goal of the sampling policy is to construct an allocation sequence that drives the confidence set ๐ถ ๐ก , ๐ฟ into the optimal region ๐ โ as efficiently as possible. Geometrically, this entails selecting arms that cause ๐ถ ๐ก , ๐ฟ to contract into the optimal cone ๐ โ with minimal sampling effort. The condition ๐ ๐ก , ๐ฟ โ ๐ โ can be expressed as
For all โ ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) โ and โ โ ๐ โ ๐ ๐ก , ๐ฟ , we have โ ๐ โ ๐ป ๐ , ๐ โ โ and โ ๐ โ ๐ป 1 , ๐ + .
(EC.4.3)
In words, every parameter vector ๐ that remains plausible must preserve all required pairwise orderings: no ๐ -optimal arm ๐ can be overtaken by any rival ๐ , and the best arm 1 must stay ahead of every suboptimal arm ๐ . Equivalently, no ๐ โ ๐ ๐ก , ๐ฟ is allowed to flip these comparisons.
The relationships ๐ โ ๐ป ๐ , ๐ โ and ๐ โ ๐ป 1 , ๐ + are equivalent to the following inequalities by adding terms on both sides of the inequalities (EC.4.1) and (EC.4.2) and reorganizing:
{ ( ๐ ๐ โ ๐ ๐ ) โค โ ( ๐ฝ โ ๐ ) โค ( ๐ ๐ โ ๐ ๐ ) โค โ ๐ฝ + ๐
( ๐ 1 โ ๐ ๐ ) โค โ ( ๐ฝ โ ๐ ) < ( ๐ 1 โ ๐ ๐ ) โค โ ๐ฝ โ ๐ .
(EC.4.4)
Now, we focus on the confidence region, which can be constructed following Cauchyโs inequality and the definition of the confidence ellipse for parameter ๐ฝ in equation (10).
๐ ๐ก , ๐ฟ
{ ๐ โ โ ๐ | โ ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) , { ( ๐ ๐ โ ๐ ๐ ) โค โ ( ๐ฝ โ ๐ ) โค โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ก โ 1 โ ๐ต ๐ก , ๐ฟ
( ๐ 1 โ ๐ ๐ ) โค โ ( ๐ฝ โ ๐ ) โค โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ก โ 1 โ ๐ต ๐ก , ๐ฟ
} ,
(EC.4.5)
where ๐ฝ ๐ก is the information matrix as defined in equation (9) and the confidence bound for parameter ๐ฝ , i.e., ๐ต ๐ก , ๐ฟ , can either be a fixed confidence bound as shown in Proposition 2.3 or a looser adaptive confidence bound introduced in Abbasi-Yadkori et al. (2011). The stopping condition ๐ ๐ก , ๐ฟ โ ๐ โ can thus be reformulated. For each ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) , we have
{ โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ก โ 1 โ ๐ต ๐ก , ๐ฟ โค ( ๐ ๐ โ ๐ ๐ ) โค โ ๐ฝ + ๐
โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ก โ 1 โ ๐ต ๐ก , ๐ฟ โค ( ๐ 1 โ ๐ ๐ ) โค โ ๐ฝ โ ๐ .
(EC.4.6)
If equation (EC.4.6) holds, then for any ๐ฝ โ ๐ ๐ก , ๐ฟ , equation (EC.4.4) also holds, implying that ๐ ๐ก , ๐ฟ โ ๐ โ . Given that ( ๐ ๐ โ ๐ ๐ ) โค โ ๐ฝ + ๐
0 and ( ๐ 1 โ ๐ ๐ ) โค โ ๐ฝ โ ๐
0 , and rearranging (EC.4.6), the oracle allocation strategy is determined as follows:
{ ๐ ๐ด ๐ } โ
arg โก min { ๐ ๐ด ๐ } โก max ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) โก max โก { 2 โ โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ก โ 1 2 ( ๐ ๐ โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ก โ 1 2 ( ๐ 1 โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 } ,
(EC.4.7)
where { ๐ ๐ด ๐ }
( ๐ ๐ด 1 , ๐ ๐ด 1 , โฆ , ๐ ๐ด ๐ ) โ ๐ ๐ is a sequence of sampled arms and { ๐ ๐ด ๐ } โ is the oracle allocation strategy5. However, it is more convenient to demonstrate the sample complexity of the problem in terms of the continuous allocation proportion ๐ instead of the discrete allocation sequence { ๐ ๐ด ๐ } . Then, we can have the following optimal allocation proportion.
๐ โ
arg โก min ๐ โ ๐ฎ ๐พ โก max ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) โก max โก { 2 โ โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ 1 โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 } ,
(EC.4.8)
where ๐ฎ ๐พ denotes the ๐พ -dimensional probability simplex, and ๐ฝ ๐
โ ๐
1 ๐พ ๐ ๐ โ ๐ ๐ โ ๐ ๐ โค is the weighted information matrix, analogous to the Fisher information matrix (Chaloner and Verdinelli 1995). The intuition behind this optimal allocation strategy is as follows: to satisfy the inequality in (EC.4.6) as quickly as possible, the ratio of the left-hand side to the right-hand side should be minimized, leading to the outer minimization operation over the allocation probability ๐ . The middle maximization operation accounts for the fact that the inequalities in (EC.4.6) must hold for each possible triple ( ๐ , ๐ , ๐ ) , which specifies the culprit set ๐ณ
๐ณ โ ( ๐ ) (see Proposition 3.1 for a formal definition). Furthermore, since both inequalities in (EC.4.6) need to be satisfied, we take inner maximization operations to enforce this requirement.
EC.4.3Proof of the Lower Bound in Theorem 3.2 Proof EC.1
Proof. Deriving the lower bound requires constructing the largest possible alternative set and identifying the most challenging instance for distinguishing the arms. Specifically, to construct an alternative problem instance that possesses a distinct set of ๐ -best arms from the original problem instance ๐ , we systematically perturb the expected rewards of specific arms. This process can be achieved through two ways of construction, as further detailed in Figure EC.4.2:
โข
Lowering an ๐ -Best Arm: This approach decreases the expected reward of a currently ๐ -best arm ๐ , while simultaneously increasing the expected reward of another arm ๐ .
โข
Elevating a Non- ๐ -Best Arm: This approach increases the expected reward of a non- ๐ -best arm ๐ , while concurrently reducing the expected rewards of the โ top-performing arms.
Figure EC.4.2:Illustration of Constructing Alternative Set of All ๐ -Best Arms Identification Lower Bound
Note. (a) Transforming an arm within ๐บ ๐ โ ( ๐ ) into a non- ๐ -best arm by decreasing the mean of an ๐ -best arm ๐ while increasing the mean of another arm ๐ . (b) Transforming an arm outside ๐บ ๐ โ ( ๐ ) into an ๐ -best arm by increasing the mean of a non- ๐ -best arm ๐ and simultaneously decreasing the means of higher-performing arms.
For all ๐ -best arms identification in linear bandits, we establish the correct answer, the culprit set, and the alternative set as follows. The general idea of the proof is to builds these alternatives explicitly and shows which one is the hardest to distinguish. Recall that ๐ ๐
โจ ๐ , ๐ ๐ โฉ . For notational simplicity, we use ๐ to represent the mean vector induced by true parameter ๐ .
โข
The correct answer ๐บ ๐ โ ( ๐ )
{ ๐ : โจ ๐ฝ , ๐ ๐ โฉ โฅ max ๐ โก โจ ๐ฝ , ๐ ๐ โฉ โ ๐ } for all ๐ โ [ ๐พ ] . These are the ๐ -best arms under the true parameter.
โข
The culprit set ๐ณ โ ( ๐ )
{ ( ๐ , ๐ , ๐ , โ ) : ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) , โ โ { 1 , 2 , โฆ , ๐ โ 1 } } . The culprit set identifies potential sources of error. Intuitively, each tuple corresponds to two types of mistakes that can affect the output as mentioned above.
โข
The alternative set Alt โ ( ๐ )
โช ๐ฅ โ ๐ณ โ ( ๐ ) Alt ๐ฅ โ ( ๐ ) and the form of Alt ๐ฅ โ ( ๐ ) is given by
Alt ๐ฅ โ ( ๐ )
Alt ๐ , ๐ โ ( ๐ ) โช Alt ๐ , โ โ ( ๐ ) โ for all โ ๐ฅ โ ๐ณ โ ( ๐ ) .
(EC.4.9)
So for each culprit ๐ฅ , we build two kinds of alternative instances, each capable of flipping the answer. The two parts of the alternative sets can be expressed as
Alt ๐ , ๐ โ ( ๐ )
{ ๐ : โจ ๐ , ๐ ๐ โ ๐ ๐ โฉ < โ ๐ } โ for all โ ๐ฅ โ ๐ณ โ ( ๐ )
(EC.4.10)
and
Alt ๐ , โ โ ( ๐ )
{ ๐ : ๐ โค โ ๐ 1
โฏ
๐ โค โ ๐ โ โฅ ๐ โค โ ๐ ๐ + ๐
๐ โค โ ๐ โ + 1 } โ for all โ ๐ฅ โ ๐ณ โ ( ๐ ) ,
(EC.4.11)
which are the same as shown in the figure.
The first part Alt ๐ , ๐ โ ( ๐ ) forces a good arm ๐ to become at least ๐ worse than arm ๐ , causing ๐ to fall out of the ๐ -best set. In contrast, the second part ensures that a suboptimal arm is no more than ๐ worse than the top โ arms with identical mean values, thereby including it in the ๐ -best set.
For a given culprit ๐ฅ
( ๐ , ๐ , ๐ , โ ) , the first component of the alternative set can be written as
Alt ๐ , ๐ โ ( ๐ )
{ ๐ ๐ , ๐ โ ( ๐ , ๐ , ๐ผ ) โฃ ๐ ๐ , ๐ โ ( ๐ , ๐ , ๐ผ )
๐ฝ โ ๐ ๐ , ๐ โค โ ๐ฝ + ๐ + ๐ผ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 โ ๐ฝ ๐ โ 1 โ ๐ ๐ , ๐ } ,
(EC.4.12)
where ๐ ๐ฉ
โ ๐
1 ๐พ ๐ ๐ โ ๐ ๐ โ ๐ ๐ โค is related to the Fisher information matrix (Chaloner and Verdinelli 1995), ๐ฒ ๐ , ๐
๐ ๐ โ ๐ ๐ , and ๐ผ
0 is a perturbation parameter. The form of this alternative set is derived from the solution to the following optimization problem:
arg โก min ๐ โ โ ๐
โ ๐ โ ๐ฝ โ ๐ฝ ๐ 2
(EC.4.13)
s.t.
๐ ๐ , ๐ โค โ ๐
โ ๐ โ ๐ผ .
(EC.4.14)
Here, ๐ผ is introduced to construct the specific alternative set, providing an explicit expression for the alternative parameter ๐ . By letting ๐ผ โ 0 , we realize the infimum in equation (EC.2.5) and equation (EC.2.6). Then, we have
๐ ๐ , ๐ โค โ ๐ ๐ , ๐ โ ( ๐ , ๐ , ๐ผ )
โ ๐ โ ๐ผ < โ ๐ ,
(EC.4.15)
which satisfies the condition for being a parameter in an alternative set as defined in equation (EC.4.10). Under the Gaussian distribution assumption, i.e., ๐ ๐ โผ ๐ฉ โ ( ๐ ๐ โค โ ๐ , 1 ) , the KL divergence between the mean value associated with the true parameter ๐ and that associated with the alternative parameter ๐ ๐ , ๐ is
KL โ ( ๐ ๐ โค โ ๐ฝ , ๐ ๐ โค โ ๐ ๐ , ๐ )
( ๐ ๐ โค โ ( ๐ฝ โ ๐ ๐ , ๐ โ ( ๐ , ๐ , ๐ผ ) ) ) 2 2 โ ( 1 ) 2
๐ ๐ , ๐ โค โ ๐ฝ ๐ โ 1 โ ( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ + ๐ผ ) 2 โ ๐ ๐ โ ๐ ๐ โค 2 โ ( โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ) 2 โ ๐ฝ ๐ โ 1 โ ๐ ๐ , ๐ .
(EC.4.16)
The last equation follows from substituting the expression for ๐ ๐ , ๐ given in (EC.4.12). Then, by Proposition 3.1 and the definition of the ๐ถ ๐ฅ function in equation (EC.2.5), the lower bound can be expressed as
๐ผ
๐
โ
[
๐
๐ฟ
]
log
โก
(
1
/
2.4
โ
๐ฟ
)
โฅ
min
๐
โ
๐
๐พ
โก
max
๐
โ
Alt
โ
(
๐
)
โก
1
โ
๐
1
๐พ
๐
๐
โ
KL
โ
(
๐
๐
โค
โ
๐ฝ
,
๐
๐
โค
โ
๐
)
โฅ
min
๐
โ
๐
๐พ
โก
max
๐ฅ
โ
๐ณ
โ
sup
๐ผ
>
0
1
โ
๐
1 ๐พ ๐ ๐ โ KL โ ( ๐ ๐ โค โ ๐ฝ , ๐ ๐ โค โ ๐ ๐ , ๐ โ ( ๐ , ๐ , ๐ผ ) )
min ๐ โ ๐ ๐พ โก max ๐ฅ โ ๐ณ โก 2 โ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ ) 2 .
(EC.4.17)
Note that we let ๐ผ โ 0 establish the result by realizing sup ๐ผ
0 . From this lower bound, we can also define the ๐ถ ๐ฅ function for all ๐ -best arms identification in linear bandits, which is
๐ถ ๐ , ๐ โ ( ๐ )
( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ ) 2 2 โ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 .
(EC.4.18)
For the second part, i.e., Alt ๐ , โ โ ( ๐ ) , we assume that the mean value of arm 1 (i.e., the best arm) remains fixed. This assumption partially sacrifices the completeness of the alternative set but allows the alternative set to be expressed in an explicit and symmetric form. This form remains tight.
Consequently, the culprit set is redefined as ๐ณ โ ( ๐ )
{ ( ๐ , ๐ , ๐ ) : ๐ โ ๐บ ๐ โ ( ๐ ) , ๐ โ ๐ , ๐ โ ๐บ ๐ โ ( ๐ ) } and the alternative set can be decomposed as Alt ๐ฅ โ ( ๐ )
Alt ๐ , ๐ โ ( ๐ ) โช Alt ๐ โ ( ๐ ) for a given culprit ๐ฅ
( ๐ , ๐ , ๐ ) . Different from equation (EC.4.11), the second part of the alternative set becomes
Alt ๐ โ ( ๐ )
{ ๐ : โจ ๐ , ๐ 1 โ ๐ ๐ โฉ < โ ๐ } โ for all โ ๐ฅ โ ๐ณ โ ( ๐ ) .
(EC.4.19)
Similarly, Alt ๐ โ ( ๐ ) can be constructed as the set in equation (EC.4.12), given by
Alt ๐ โ ( ๐ )
{ ๐ ๐ โ ( ๐ , ๐ , ๐ผ ) โฃ ๐ ๐ โ ( ๐ , ๐ , ๐ผ )
๐ฝ โ ๐ ๐ โค โ ๐ฝ + ๐ + ๐ผ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 โ ๐ฝ ๐ โ 1 โ ๐ ๐ } ,
(EC.4.20)
where ๐ ๐
๐ 1 โ ๐ ๐ and ๐ผ
0 . Then we have
๐ ๐ โค โ ๐ ๐ โ ( ๐ , ๐ , ๐ผ )
๐ โ ๐ผ < ๐ .
(EC.4.21)
Hence, by following a derivation analogous to that of the first part, we obtain
๐ถ ๐ โ ( ๐ )
( ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 2 โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 .
(EC.4.22)
Deriving the tightest lower bound focuses on constructing alternative bandit instances that thoroughly challenge each ๐ -best arm configuration.
The minimum of the information function ๐ถ ๐ฅ over all culprits ๐ฅ โ ๐ณ โ ( ๐ ) is then given by:
min โก { min ๐ฅ โ ๐ณ โก ๐ถ ๐ , ๐ โ ( ๐ ) , min ๐ฅ โ ๐ณ โก ๐ถ ๐ โ ( ๐ ) } .
(EC.4.23)
Then, by Proposition 3.1 and the definition of ๐ถ ๐ฅ function, combining equation (EC.4.18) and equation (EC.4.22), the final lower bound can be expressed as
๐ผ ๐ โ [ ๐ ๐ฟ ] log โก ( 1 / 2.4 โ ๐ฟ )
โฅ min ๐ โ ๐ ๐พ โก max ( ๐ , ๐ , ๐ ) โ ๐ณ โก max โก { 2 โ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 }
(EC.4.24)
= min ๐ โ ๐ ๐พ โก max ( ๐ , ๐ , ๐ ) โ ๐ณ โก max โก { 2 โ โ ๐ ๐ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ 1 โค โ ๐ฝ โ ๐ ๐ โค โ ๐ฝ โ ๐ ) 2 } .
(EC.4.25) \Halmos Appendix EC.5Proof of Theorem 4.1
The proof of Theorem 4.1 consists of two parts. First, we establish the upper bound given in equation (EC.5.4). Then, we derive the final refined bound in equation (26), removing the summation in our first result.
To establish the result involving the summation in equation (EC.5.4), we define ๐ max
min โก { ๐ : ๐บ ๐
๐บ ๐ } as the round in which the last ๐ -best arm is added to ๐บ ๐ . We divide the total number of samples into two phases: samples collected up to round ๐ max and those collected from round ๐ max + 1 until termination (if the algorithm does not stop at ๐ max ). The proof proceeds in eight steps, as outlined below.
EC.5.1Preliminary: Clean Events โฐ 1 and โฐ 2
We begin by defining two high-probability events, denoted as โฐ 1 and โฐ 2 , which we refer to as clean events.
โฐ 1
{ โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ | โค ๐ถ ๐ฟ / ๐พ โ ( ๐ ) } .
(EC.5.1)
This event captures the condition that, for each round ๐ , the estimated means ๐ ^ ๐ โ ( ๐ ) of all arms ๐ in the active set ๐ ๐ผ โ ( ๐ โ 1 ) lie within a confidence bound ๐ถ ๐ฟ / ๐พ โ ( ๐ ) of their true means ๐ ๐ . It ensures uniform control over estimation errors across all arms and rounds with a prescribed confidence level.
Then, we introduce the following lemma to provide the confidence region for the estimated parameter ๐ฝ ^ .
Lemma EC.1 (Lattimore and Szepesvรกri (2020))
Let confidence level ๐ฟ โ ( 0 , 1 ) , for each arm ๐ โ ๐ , we have
โ โ { | โจ ๐ฝ ^ โ ๐ฝ , ๐ โฉ | โฅ 2 โ โ ๐ โ ๐ฝ ๐ก โ 1 2 โ log โก ( 2 ๐ฟ ) } โค ๐ฟ ,
(EC.5.2)
where ๐ ๐ก represents the information matrix defined in Section 2.2.
Let ๐ ๐ denote the optimal allocation proportion computed for round ๐ . The corresponding sampling budget is then determined according to equation (21). Thus, we obtain the information matrix in each round, denoted as ๐ฝ ๐ 6.
๐ฝ ๐
โ ๐ โ Supp โ ( ๐ ๐ ) ๐ ๐ โ ( ๐ ) โ ๐ โ ๐ โค โชฐ 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ ๐ฝ โ ( ๐ ) .
(EC.5.3)
Then, by applying Lemma EC.1 with ๐ฟ replaced by ๐ฟ / ( ๐พ โ ๐ โ ( ๐ + 1 ) ) , we obtain that for any arm ๐ โ ๐ โ ( ๐ โ 1 ) , with probability at least 1 โ ๐ฟ / ( ๐พ โ ๐ โ ( ๐ + 1 ) ) , we have
|
โจ
๐ฝ
^
๐
โ
๐ฝ
,
๐
โฉ
|
โค
2
โ
โ
๐
โ
๐ฝ
๐
โ
1
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
2 โ ๐ โค โ ๐ฝ ๐ โ 1 โ ๐ โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ )
โค 2 โ ๐ โค โ ( ๐ ๐ 2 2 โ ๐ โ 1 log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ ๐ฝ โ ( ๐ ) โ 1 ) โ ๐ โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ )
โค ๐ ๐ ,
(EC.5.4)
where the third line follows from the matrix inequality in (EC.5.3), using the auxiliary result in Lemma EC.1, and the final line is derived from Lemma EC.5.
Thus, by applying the standard result of the G-optimal design in equation (EC.5.1), we define the confidence radius associated with the clean event โฐ 1 as
๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ ๐ .
(EC.5.5)
Then, we have
โ โ ( โฐ 1 ๐ )
โ
โ
{
โ
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โ
๐
โ
โ
|
๐
^
๐
โ
(
๐
)
โ
๐
๐
|
>
๐ถ
๐ฟ
/
๐พ
โ
(
๐
)
}
โค
โ
๐
1
โ
โ
โ
{
โ
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
|
๐
^
๐
โ
(
๐
)
โ
๐
๐
|
>
๐
๐
}
โค
โ
๐
1 โ โ ๐
1 ๐พ ๐ฟ ๐พ โ ๐ โ ( ๐ + 1 )
๐ฟ ,
(EC.5.6)
where the second line follows from the union bound, and the third line combines the union bound with equation (EC.5.1). Therefore, we obtain
๐ โ ( โฐ 1 ) โฅ 1 โ ๐ฟ .
(EC.5.7)
Now consider another event, โฐ 2 , which characterizes the gaps between different arms, defined as
โฐ 2
{ โ ๐ โ ๐บ ๐ โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ( ๐ ^ ๐ โ ( ๐ ) โ ๐ ^ ๐ โ ( ๐ ) ) โ ( ๐ ๐ โ ๐ ๐ ) | โค 2 โ ๐ ๐ } .
(EC.5.8)
This event ensures the gap between the estimated mean rewards of arms ๐ and ๐ is uniformly close to their true gap for all rounds ๐ , arms ๐ โ ๐บ ๐ , and arms ๐ in the active set ๐ ๐ผ โ ( ๐ โ 1 ) .
By (EC.5.1), for ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , we have
โ
โ
{
|
(
๐
^
๐
โ
๐
^
๐
)
โ
(
๐
๐
โ
๐
๐
)
|
>
2
โ
๐
๐
โฃ
โฐ
1
}
โค
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
+
|
๐
^
๐
โ
๐
๐
|
>
2
โ
๐
๐
โฃ
โฐ
1
}
โค
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
>
๐
๐
โฃ
โฐ
1
}
+
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
>
๐
๐
โฃ
โฐ
1
}
0 ,
(EC.5.9)
which means
โ โ ( โฐ 2 โฃ โฐ 1 )
1 .
(EC.5.10) EC.5.2Step 1: Correctness
Recall that ๐บ ๐ denotes the true set of ๐ -best arms, and ๐บ ๐ is the empirical good set identified by LinFACT-G in round ๐ . Under event โฐ 1 , we first show that if there exists a round ๐ such that ๐บ ๐ โช ๐ต ๐
[ ๐พ ] , then it must be that ๐บ ๐
๐บ ๐ . This implies that under the clean event โฐ 1 , the stopping condition of LinFACT-G ensures correct identification of ๐บ ๐ .
Lemma EC.2
Under event โฐ 1 , we have ๐บ ๐ โ ๐บ ๐ for all rounds ๐ โ โ .
Proof EC.3
Proof. We first show that 1 โ ๐ ๐ผ โ ( ๐ ) for all ๐ โ โ ; that is, the best arm is never eliminated from the active set ๐ โ ( ๐ โ 1 ) in any round ๐ on the event โฐ 1 . For any arm ๐ , we have
๐ ^ 1 โ ( ๐ ) + ๐ ๐ โฅ ๐ 1 โฅ ๐ ๐ โฅ ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐
๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ โ ๐ ,
(EC.5.11)
which implies that ๐ ^ 1 โ ( ๐ ) + ๐ ๐ > max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ๐ โ ๐
๐ฟ ๐ and ๐ ^ 1 โ ( ๐ ) + ๐ ๐ โฅ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ . These inequalities confirm that arm 1 will not be removed from the active set ๐ ๐ผ โ ( ๐ โ 1 ) .
Secondly, we show that at all rounds ๐ , ๐ 1 โ ๐ โ [ ๐ฟ ๐ , ๐ ๐ ] . Since arm 1 never exists ๐ ๐ผ โ ( ๐ โ 1 ) ,
๐ ๐
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ + ๐ ๐ โ ๐ โฅ ๐ ^ 1 โ ( ๐ ) + ๐ ๐ โ ๐ โฅ ๐ 1 โ ๐ ,
(EC.5.12)
and for any arm ๐ ,
๐ 1 โ ๐ โฅ ๐ ๐ โ ๐ โฅ ๐ ^ ๐ โ ๐ ๐ โ ๐ .
(EC.5.13)
Hence, taking the maximum over ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , we obtain
๐ 1 โ ๐ โฅ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ๐ โ ๐
๐ฟ ๐ .
(EC.5.14)
Next, we show that ๐บ ๐ โ ๐บ ๐ for all ๐ โฅ 1 . By contradiction, if ๐บ ๐ โ ๐บ ๐ , then it means that โ ๐ โ โ and โ ๐ โ ๐บ ๐ ๐ โฉ ๐บ ๐ such that,
๐ ๐ โฅ ๐ ^ ๐ โ ๐ ๐ โฅ ๐ ๐ โฅ ๐ 1 โ ๐
๐ ๐ ,
(EC.5.15)
which forms a contradiction. \Halmos
Lemma EC.4
Under event โฐ 1 , we have ๐ต ๐ โ ๐บ ๐ ๐ for all rounds ๐ โ โ .
Proof EC.5
Proof. Similarly, we proceed by contradiction. Consider the case that a good arm from ๐บ ๐ is added to ๐ต ๐ for some round ๐ . By definition, ๐ต 0
โ and ๐ต ๐ โ 1 โ ๐ต ๐ for all ๐ . Then there must exist some ๐ โ โ and an ๐ โ ๐บ ๐ such that ๐ โ ๐ต ๐ and ๐ โ ๐ต ๐ โ 1 . Following line 6 of Algorithm 4.1, this occurs if and only if
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐
2 โ ๐ ๐ + ๐ .
(EC.5.16)
On the clean event โฐ 1 , the above implies โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) such that
๐ ๐ โ ๐ ๐ + 2 โ ๐ ๐ โฅ ๐ ^ ๐ โ ๐ ^ ๐ โฅ 2 โ ๐ ๐ + ๐ ,
(EC.5.17)
which yields ๐ ๐ โ ๐ ๐ โฅ ๐ , contradicting that ๐ โ ๐บ ๐ . \Halmos
The above Lemma EC.2 and Lemma EC.4 show that under โฐ 1 , ๐บ ๐ โช ๐ต ๐
[ ๐พ ] can lead to the result ๐บ ๐
๐บ ๐ and ๐ต ๐
๐บ ๐ ๐ . Since โ โ { โฐ 1 } โฅ 1 โ ๐ฟ , if LinFACT terminates, it can correctly provide the correct decision rule with a probability of least 1 โ ๐ฟ . Up to now, we have demonstrated the correctness of the algorithmโs stopping rule. Then we will focus on bounding the sample complexity in the following parts.
EC.5.3Step 2: Total Sample Count
To bound the expected sampling budget, we decompose the total number of samples into two parts: the budget used before the round when the last arm is added to the good set ๐บ ๐ , and the budget used after this round until termination. The total number of samples drawn by the algorithm can thus be represented as
๐
โค
โ
๐
1 โ ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โ ๐
1 โ ๐ โ [ ๐บ ๐ โ 1 โ ๐บ ๐ ] โ ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
(EC.5.18)
โ ๐
1 โ ๐ โ [ ๐บ ๐ โ 1
๐บ ๐ ] โ ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) .
(EC.5.19)
In the following parts, we will bound these two terms separately. We begin by analyzing a single arm ๐ โ ๐บ ๐ , tracking which sets it belongs to as the rounds progress.
For ๐ โ ๐บ ๐ , let ๐ ๐ denote the number of rounds in which arm ๐ is sampled before being added to ๐บ ๐ in line 8 of Algorithm 4.1. For ๐ โ ๐บ ๐ ๐ , let ๐ ๐ denote the number of rounds in which arm ๐ is sampled before being removed from ๐ ๐ผ โ ( ๐ โ 1 ) and added to ๐ต ๐ in line 6 of Algorithm 4.1. Then, by definition, ๐ ๐ can be expressed as
๐ ๐
min โก { ๐ : { ๐ โ ๐บ ๐
if โ ๐ โ ๐บ ๐
๐
โ
๐
๐ผ
โ
(
๐
)
if
โ
๐
โ
๐บ
๐
๐
}
min โก { ๐ : { ๐ ^ ๐ โ ๐ ๐ โฅ ๐ ๐
if โ ๐ โ ๐บ ๐
๐ ^ ๐ + ๐ ๐ โค ๐ฟ ๐
if โ ๐ โ ๐บ ๐ ๐ } .
(EC.5.20) Bound ๐ ๐ .
We define a helper function โ โ ( ๐ฅ )
log 2 โก ( 1 / | ๐ฅ | ) to facilitate the proof. It can be observed that in round ๐ , if ๐ โฅ โ โ ( ๐ฅ ) , then ๐ ๐
2 โ ๐ โค | ๐ฅ | .
Lemma EC.6
For any ๐ โ ๐บ ๐ , we have ๐ ๐ โค โ โ โ ( 0.25 โ ( ๐ โ ฮ ๐ ) ) โ , where ฮ ๐
๐ 1 โ ๐ ๐ .
Proof EC.7
Proof. Note that for ๐ โ ๐บ ๐ , the inequality 4 โ ๐ ๐ < ๐ ๐ โ ( ๐ 1 โ ๐ ) holds when ๐
โ โ ( 0.25 โ ( ๐ โ ฮ ๐ ) ) . This implies that for all ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) ,
๐ ^ ๐ โ ๐ ๐
โฅ ๐ ๐ โ 2 โ ๐ ๐
๐ 1 + 2 โ ๐ ๐ โ ๐
โฅ ๐ ๐ + 2 โ ๐ ๐ โ ๐
โฅ ๐ ^ ๐ + ๐ ๐ โ ๐ .
(EC.5.21)
Thus, in particular, ๐ ^ ๐ โ ๐ ๐ > max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ( ๐ก ) + ๐ ๐ โ ๐
๐ ๐ . Therefore, we conclude that ๐ ๐ โค โ โ โ ( 0.25 โ ( ๐ โ ฮ ๐ ) ) โ . \Halmos
After determining the latest round in which any arm ๐ โ ๐บ ๐ is added to ๐บ ๐ , we define ๐ max as the round by which all good arms have been added to ๐บ ๐ , i.e., ๐ max โ min โก { ๐ : ๐บ ๐
๐บ ๐ }
max ๐ โ ๐บ ๐ โก ๐ ๐ .
Lemma EC.8
๐ max โค
โ โ โ ( 0.25 โ ๐ผ ๐ ) โ .
Proof EC.9
Proof. Recall that ๐ผ ๐
min ๐ โ ๐บ ๐ โก ๐ ๐ โ ๐ 1 + ๐
min ๐ โ ๐บ ๐ โก ๐ โ ฮ ๐ . By Lemma EC.6, ๐ ๐ โค โ โ โ ( 0.25 โ ( ๐ โ ฮ ๐ ) ) โ for ๐ โ ๐บ ๐ . Furthermore, โ โ ( โ ) is monotonically decreasing if ๐ โ ๐บ ๐ . Then for any ๐ฟ > 0 , ๐ max
max ๐ โ ๐บ ๐ โก ๐ ๐ โค max ๐ โ ๐บ ๐ โก โ โ โ ( 0.25 โ ( ๐ โ ฮ ๐ ) ) โ
โ โ โ ( min ๐ โ ๐บ ๐ โก ๐ โ ฮ ๐ ) โ
โ โ โ ( 0.25 โ ๐ผ ๐ ) โ . \Halmos
Bound the Total Samples up to ๐ ๐ โ ๐ โ ๐ฅ .
The total number of samples up to round ๐ max is โ ๐
1 ๐ max โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) . By line 4 of the Algorithm 4.1, we have
๐ ๐ โค 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 .
(EC.5.22)
Hence,
โ ๐
1 ๐ max ๐ ๐
โ ๐
1
๐
max
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
1
๐
max
(
2
โ
๐
๐
๐
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
โค
โ
๐
1 ๐ max 2 2 โ ๐ + 1 โ ๐ โ log โก ( 2 โ ๐พ โ ๐ max โ ( ๐ max + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 โ ๐ max
โค ๐ โ 2 2 โ ๐ max + 1 โ ๐ โ log โก ( 2 โ ๐พ โ ๐ max โ ( ๐ max + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 โ ๐ max ,
(EC.5.23)
where ๐ is a universal constant, and recall that ๐ max โค
โ โ โ ( 0.25 โ ๐ผ ๐ ) โ . The second line follows from equation (EC.5.22). The third line applies a scaling argument based on the range of the summation. The last line replaces the term 2 โ ๐ + 1 with the largest round ๐ max and introduces a finite constant ๐ .
Next, we bound two terms in equation (EC.5.18) and equation (EC.5.19) separately. The first term, which represents the samples taken before round ๐ max , can be bounded in the following step.
Bound (EC.5.18).
Recall that ๐ max โค โ โ โ ( 0.25 โ ๐ผ ๐ ) โ is the round where ๐บ ๐ max
๐บ ๐ . We have
โ ๐
1 โ ๐ โ [ ๐บ ๐ โ 1 โ ๐บ ๐ ] โ ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โ ๐
1 ๐ max ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โค ๐ โ 2 2 โ ๐ max + 1 โ ๐ โ log โก ( 2 โ ๐พ โ ๐ max โ ( ๐ max + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 โ ๐ max ,
(EC.5.24)
where the second line follows from the definition of ๐ max , as no additional samples are collected after round ๐ max . The third line follows directly from equation (EC.5.23).
Bound (EC.5.19).
Next, we have
โ ๐
1 โ ๐ โ [ ๐บ ๐ โ 1
๐บ ๐ ] โ ๐ โ [ ๐บ ๐ โ 1 โช ๐ต ๐ โ 1 โ [ ๐พ ] ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โ ๐
๐
max
+
1
โ
๐
โ
[
๐ต
๐
โ
1
โ
๐บ
๐
๐
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
๐
max
+
1
โ
|
๐บ
๐
๐
๐ต
๐
โ
1
|
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โ ๐
๐ max + 1 โ โ ๐ โ ๐บ ๐ ๐ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โ ๐ โ ๐บ ๐ ๐ โ ๐
๐
max
+
1
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
โ
๐บ
๐
๐
โ
๐
1
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
โ
๐บ
๐
๐
โ
๐
1 โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 ) ,
(EC.5.25)
where the second line follows from the definition of ๐ max , as ๐บ ๐
๐บ
๐
for all rounds beyond
๐
max
. The third line uses the fact that as long as
๐บ
๐
๐
๐ต
๐
โ
1
is not the empty set, the corresponding indicator function is 1. The fourth line considers the indicator function for each arm in
๐บ
๐
. The fifth and sixth lines exchange the order of the double summation and enlarge the summation range over the rounds
๐
.
EC.5.4Step 3: Bound the Expected Total Samples of LinFACT
We now take expectations over the total number of samples drawn for the given bandit instance ๐ . These expectations are conditioned on the high-probability event โฐ 1 .
๐ผ
๐
โ
[
๐
๐บ
โฃ
โฐ
1
]
โค
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐บ ๐ โช ๐ต ๐ โ [ ๐พ ] ] โฃ โฐ 1 ] โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ )
โ ๐
1
โ
๐ผ
๐
โ
[
๐
โ
[
๐บ
๐
โ
1
โ
๐บ
๐
]
โ
๐
โ
[
๐บ
๐
โ
1
โช
๐ต
๐
โ
1
โ
[
๐พ
]
]
โฃ
โฐ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
+
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐บ ๐ โ 1
๐บ
๐
]
โ
๐
โ
[
๐บ
๐
โ
1
โช
๐ต
๐
โ
1
โ
[
๐พ
]
]
โฃ
โฐ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
(
EC.5.24
)
๐
โ
2
2
โ
๐
max
+
1
โ
๐
โ
log
โก
(
2
โ
๐พ
โ
๐
max
โ
(
๐
max
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
โ
๐
max
+
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐บ ๐ โ 1
๐บ
๐
]
โ
๐
โ
[
๐บ
๐
โ
1
โช
๐ต
๐
โ
1
โ
[
๐พ
]
]
โฃ
โฐ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
(
EC.5.25
)
๐
โ
2
2
โ
๐
max
+
1
โ
๐
โ
log
โก
(
2
โ
๐พ
โ
๐
max
โ
(
๐
max
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
โ
๐
max
+
โ
๐
โ
๐บ
๐
๐
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 โฃ โฐ 1 ] ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 ) .
(EC.5.26)
The first line follows from the stopping condition of LinFACT-G and the additivity of the expectation. The second line applies the decomposition established in Step 2. The subsequent lines use the results from equation (EC.5.24) and equation (EC.5.25).
Next, we bound the last term. For a given ๐ โ ๐บ ๐ ๐ and round ๐ , we first bound the probability that ๐ โ ๐ต ๐ . By the Borel-Cantelli lemma, this implies that the probability of ๐ never being added to any ๐ต ๐ is zero.
Lemma EC.10
For ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 ]
0 .
Proof EC.11
Proof.
First, we have for any ๐ โ ๐บ ๐ ๐ ,
๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 ]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 , ๐ โ ๐ต ๐ โ ( ๐
{ 1 , 2 , โฆ , ๐ โ 1 } ) ]
โค ๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 ] ,
(EC.5.27)
where the first line follows from the fact that arm ๐ โ ๐บ ๐ ๐ will never be added into ๐ต ๐ from round 1 to round ๐ โ 1 if ๐ โ ๐ต ๐ , and the second line accounts for the conditional expectation.
If ๐ โ ๐ต ๐ โ 1 , then ๐ โ ๐ต ๐ by definition. Otherwise, if ๐ โ ๐ต ๐ โ 1 , then under event โฐ 1 , for ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , we have
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ ฮ ๐ โ 2 โ ๐ + 1 โฅ ๐ + 2 โ ๐ ๐ ,
(EC.5.28)
which implies that ๐ โ ๐ต ๐ by line 6 of the Algorithm 4.1. In other words, we have established the correctness of the algorithm when line 6 is triggered in Step 1. We now specify the exact condition under which line 6 will occur. In particular, under event โฐ 1 , if ๐ โ ๐ต ๐ โ 1 , then for all ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , we have
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ > 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 ]
1 .
(EC.5.29)
Therefore, for all ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , we have
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
]
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
๐
โ
๐ต
๐
โ
1
,
โฐ
1
]
โ
โ
โ
(
๐
โ
๐ต
๐
โ
1
โฃ
โฐ
1
)
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
๐
โ
๐ต
๐
โ
1
,
โฐ
1
]
โ
โ
โ
(
๐
โ
๐ต
๐
โ
1
โฃ
โฐ
1
)
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 ] โ โ โ ( ๐ โ ๐ต ๐ โ 1 โฃ โฐ 1 )
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 ] โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ]
0 ,
(EC.5.30)
where the second line follows from the additivity of expectation. The fourth line follows the deterministic result that ๐ โ [ ๐ โ ๐ต ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ]
0 . The fifth line is the decomposition based on the conditional expectation. The eighth line uses the fact that the expectation of the indicator function is simply the probability. The last line follows the result in equation (EC.5.29). The lemma then follows by combining this result with equation (EC.11). \Halmos
Lemma EC.12
For ๐ โ ๐บ ๐ ๐ , we have
โ ๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 )
โค ๐ โ ( ๐ + 1 ) 2 โ log 2 โก ( 8 ฮ ๐ โ ๐ ) + ๐ โ 256 โ ๐ ( ฮ ๐ โ ๐ ) 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ฮ ๐ โ ๐ ) .
(EC.5.31) Proof EC.13
Proof.
โ ๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 )
โ ๐
1
โ
log
2
โก
(
4
ฮ
๐
โ
๐
)
โ
๐ผ
๐
โ
[
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
]
โ
(
2
โ
๐
๐
๐
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
+
โ
๐
โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ + 1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 )
โ ๐
1
โ
log
2
โก
(
4
ฮ
๐
โ
๐
)
โ
๐ผ
๐
โ
[
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
]
โ
(
2
โ
๐
๐
๐
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
+
0
โค
โ
๐
1
โ
log
2
โก
(
4
ฮ
๐
โ
๐
)
โ
(
๐
โ
2
2
โ
๐
+
1
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
โค
๐
โ
(
๐
+
1
)
2
โ
log
2
โก
(
8
ฮ
๐
โ
๐
)
+
2
โ
๐
โ
log
โก
(
2
โ
๐พ
๐ฟ
)
โ
โ
๐
1 โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ 2 2 โ ๐ + 4 โ ๐ โ โ ๐
1 โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ 2 2 โ ๐ โ log โก ( ๐ + 1 )
โค ๐ โ ( ๐ + 1 ) 2 โ log 2 โก ( 8 ฮ ๐ โ ๐ ) + ๐ โ 256 โ ๐ ( ฮ ๐ โ ๐ ) 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ฮ ๐ โ ๐ ) ,
(EC.5.32)
Here, ๐ is a sufficiently large universal constant. The second line follows from decomposing the summation across all rounds. The fourth line uses Lemma EC.10. The fifth line bounds the expectation of the indicator function to its maximum value of 1. The sixth and seventh lines replace the round ๐ with its maximum value โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ . \Halmos
Summarizing the aforementioned results, we have
๐ผ
๐
โ
[
๐
๐บ
โฃ
โฐ
1
]
โค
๐
โ
2
2
โ
๐
max
+
1
โ
๐
โ
log
โก
(
2
โ
๐พ
โ
๐
max
โ
(
๐
max
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
โ
๐
max
+
โ
๐
โ
๐บ
๐
๐
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 โฃ โฐ 1 ] ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 ) ,
(EC.5.33)
where
๐ max โค โ โ โ ( 0.25 โ ๐ผ ๐ ) โ
log 2 โก 8 ๐ผ ๐ .
(EC.5.34)
Also, we have
โ ๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 )
โค ๐ โ ( ๐ + 1 ) 2 โ log 2 โก ( 8 ฮ ๐ โ ๐ ) + ๐ โ 256 โ ๐ ( ฮ ๐ โ ๐ ) 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ฮ ๐ โ ๐ ) .
(EC.5.35)
Then, we arrive at the final result as follows,
๐ผ
๐
โ
[
๐
๐บ
โฃ
โฐ
1
]
โค
๐
โ
2
2
โ
๐
max
+
1
โ
๐
โ
log
โก
(
2
โ
๐พ
โ
๐
max
โ
(
๐
max
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
โ
๐
max
+
โ
๐
โ
๐บ
๐
๐
โ
๐
1 โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 โฃ โฐ 1 ] ] โ ( 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ๐ โ ( ๐ + 1 ) 2 )
โค ๐ โ 256 โ ๐ ๐ผ ๐ 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก ( 16 ๐ผ ๐ ) ) + ๐ โ ( ๐ + 1 ) 2 โ log 2 โก 8 ๐ผ ๐
- โ ๐ โ ๐บ ๐ ๐ ( ๐ โ ( ๐
- 1 ) 2 โ log 2 โก ( 8 ฮ ๐ โ ๐ )
- ๐ โ 256 โ ๐ ( ฮ ๐ โ ๐ ) 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ฮ ๐ โ ๐ ) ) ,
(EC.5.36)
which can be further expressed as
๐ผ โ [ ๐ ๐บ โฃ โฐ 1 ]
๐ช โ ( ๐ โ ๐ผ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log โก ( ๐ผ ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ผ ๐ โ 1 ) )
- โ ๐ โ ๐บ ๐ ๐ ( ๐ช ( ๐ 2 log ( ฮ ๐ โ ๐ ) โ 1
- ๐ ( ฮ ๐ โ ๐ ) โ 2 log ( ๐พ ๐ฟ log ( ฮ ๐ โ ๐ ) โ 2 ) ) ) .
(EC.5.37) EC.5.5A Refined Bound
The result obtained in the previous steps involves a summation over the set ๐บ ๐ , which can be further improved by eliminating this summation. Rather than focusing solely on the round ๐ max , defined in Lemma EC.10 as the round in which all arms in ๐บ ๐ are classified into ๐บ ๐ , we now define the round at which all classifications are complete, i.e., ๐บ ๐ โช ๐ต ๐
[ ๐พ ] .
Lemma EC.14
For ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 ]
0 .
Proof EC.15
Proof. First, for any ๐ โ ๐บ ๐
๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 ]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 , ๐ โ ๐บ ๐ โ ( ๐
{ 1 , 2 , โฆ , ๐ โ 1 } ) ]
โค ๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 ] .
(EC.5.38)
If ๐ โ ๐บ ๐ โ 1 , then ๐ โ ๐บ ๐ by definition. Otherwise, if ๐ โ ๐บ ๐ โ 1 , then under event โฐ 1 , for ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ , we have
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค ๐ arg โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ๐ + 2 โ ๐ + 1 โค ฮ ๐ + 2 โ ๐ + 1 โค ๐ โ 2 โ ๐ ๐ ,
(EC.5.39)
which implies that ๐ โ ๐บ ๐ by line 8 of the Algorithm 4.1. In other words, we now specify the exact condition under which line 8 will occur. In particular, under event โฐ 1 , if ๐ โ ๐บ ๐ โ 1 , for all ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ , we have
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค โ 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 ]
1 .
(EC.5.40)
Deterministically, ๐ โ [ ๐ โ ๐บ ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ]
0 . Therefore,
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
โฐ
1
]
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
โฐ
1
]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ โฐ 1 ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
๐
โ
๐บ
๐
โ
1
,
โฐ
1
]
โ
โ
โ
(
๐
โ
๐บ
๐
โ
1
โฃ
โฐ
1
)
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
๐
โ
๐บ
๐
โ
1
,
โฐ
1
]
โ
โ
โ
(
๐
โ
๐บ
๐
โ
1
โฃ
โฐ
1
)
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 ] โ โ โ ( ๐ โ ๐บ ๐ โ 1 โฃ โฐ 1 )
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 ] โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ โฐ 1 ]
0 ,
(EC.5.41)
where the second line comes from the additivity of expectation. The fourth line follows the deterministic result that ๐ โ [ ๐ โ ๐บ ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ]
0 . The fifth line applies a decomposition based on the conditional expectation. The eighth line uses the fact that the expectation of an indicator function equals the corresponding probability. The final line follows from the result in equation (EC.5.40). The lemma then follows by combining this result with equation (EC.15). \Halmos
Lemma EC.16
The round at which all classifications are complete and the final answer is returned is ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } .
Proof EC.17
Proof. Combining the result of Lemma EC.14 with that of Lemma EC.10, we have the following: for ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 ]
0 ; for ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 ]
0 . Since ๐ผ ๐
min ๐ โ ๐บ ๐ โก ( ๐ โ ฮ ๐ ) and ๐ฝ ๐
min ๐ โ ๐บ ๐ ๐ โก ( ฮ ๐ โ ๐ ) , for any round ๐ โฅ ๐ upper , all arms have been classified into either ๐บ ๐ or ๐ต ๐ , marking the termination of the algorithm. \Halmos
Lemma EC.18
For the expected sample complexity conditioned on the high-probability event โฐ 1 , we have
๐ผ ๐ โ [ ๐ ๐ณ โ ๐ด โฃ โฐ 1 ]
โค ๐ โ max โก { 256 โ ๐ ๐ผ ๐ 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ผ ๐ ) , 256 โ ๐ ๐ฝ ๐ 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ฝ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper ,
(EC.5.42)
where ๐ is a universal constant and ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } .
Proof EC.19
Proof. We have
๐ผ
๐
โ
[
๐
๐ณ
โ
๐ด
โฃ
โฐ
1
]
โค
โ
๐
1
โ
๐ผ
๐
โ
[
๐
โ
[
๐บ
๐
โช
๐ต
๐
โ
[
๐พ
]
]
โฃ
โฐ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
1
๐
upper
(
๐
โ
2
2
โ
๐
+
1
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
โค
๐
โ
(
๐
+
1
)
2
โ
๐
upper
+
2
โ
๐
โ
log
โก
(
2
โ
๐พ
๐ฟ
)
โ
โ
๐
1 ๐ upper 2 2 โ ๐ + 4 โ ๐ โ โ ๐
1
๐
upper
2
2
โ
๐
โ
log
โก
(
๐
+
1
)
โค
4
โ
log
โก
[
2
โ
๐พ
๐ฟ
โ
(
๐
upper
+
1
)
]
โ
โ
๐
1 ๐ upper ๐ โ 2 2 โ ๐ + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper
(EC.5.43)
โค ๐ โ max โก { 256 โ ๐ ๐ผ ๐ 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ผ ๐ ) , 256 โ ๐ ๐ฝ ๐ 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ฝ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper .
(EC.5.44)
The second line follows from Lemma EC.16. Then, we have
๐ผ โ [ ๐ ๐บ โฃ โฐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log 2 โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) ,
(EC.5.45)
where ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 is the minimum gap between ๐ผ ๐ and ๐ฝ ๐ , indicating the difficulty of the problem instance.
\Halmos Appendix EC.6Additional Insights into the Algorithm Optimality
From another perspective, we consider the relationship between the lower bound and the upper bound in the following section and give some additional insights into the algorithm optimality. This relationship serves as the basis for the derivation of a near-optimal upper bound in Theorem 4.2.
For โ ๐ โ ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ ) and โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , ๐ โ ๐ , in round ๐ , we have
๐ ๐ , ๐ โค โ ( ๐ฝ ^ ๐ โ ๐ฝ ) โค 2 โ ๐ ๐
(EC.6.1)
and
๐ ๐ , ๐ โค โ ๐ฝ ^ ๐ โ ๐ โค ๐ ๐ , ๐ โค โ ๐ฝ + 2 โ ๐ ๐ โ ๐ .
(EC.6.2) Lemma EC.1
Define ๐บ ๐ โฒ โ { โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , ๐ โ ๐ , ๐ : ๐ฒ ๐ , ๐ โค โ ๐ โ ๐
โ 4 โ ๐ ๐ } . We always have ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ โฉ ๐บ ๐ ๐ ) โ ๐บ ๐ โฒ .
Proof EC.2
Proof. For ๐
1 , the lemma follows directly from the assumption in Theorem 4.2 that max ๐ โ [ ๐พ ] โก | ๐ 1 โ ๐ โ ๐ ๐ | โค 2 . For ๐ โฅ 2 , we proceed by contradiction. Suppose ๐ โ ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ โฉ ๐บ ๐ ๐ ) โฉ ( ๐บ ๐ โฒ ) ๐ , then for every ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) with ๐ โ ๐ , we have
๐ ๐ , ๐ โค โ ๐ฝ โค โ 4 โ ๐ ๐ + ๐ .
(EC.6.3)
Hence, using equation (EC.6.2), for every ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) and ๐ โ ๐ , we have
๐ ๐ , ๐ โค โ ๐ฝ ^ ๐ โ ๐ โค โ 2 โ ๐ ๐ ,
(EC.6.4)
which is exactly the condition for the algorithm to add arm ๐ into ๐บ ๐ at line 8 of the algorithm, which yields a contradiction and completes the argument. Moreover, note that when ๐ โฅ โ log 2 โก 4 ๐ผ ๐ โ , we have ๐บ ๐ โฒ โฉ ๐บ ๐
โ . Furthermore, considering that ๐ โ ๐บ ๐ , we have ๐ ๐ , ๐ โค โ ๐ฝ + ๐
0 . \Halmos
There is, however, one exceptional case that invalidates the above derivation. Specifically, when ๐ is the index of the arm with the largest mean value, i.e., ๐
arg โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , the proof no longer holds. For this situation to occur, it must be the case that arg โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐บ ๐ ๐ , which is equivalent to ๐ โค 2 โ ๐ ๐ . Since this condition can only hold for a limited number of rounds, its impact is negligible and can therefore be ignored.
On the other hand, for ๐ โ ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ ๐ ) , in any round ๐ , we have
๐ 1 , ๐ โค โ ( ๐ฝ โ ๐ฝ ^ ๐ ) โค 2 โ ๐ ๐
(EC.6.5)
and
๐ 1 , ๐ โค โ ๐ฝ ^ ๐ โ ๐ โฅ ๐ 1 , ๐ โค โ ๐ฝ โ 2 โ ๐ ๐ โ ๐ .
(EC.6.6) Lemma EC.3
Define ๐ต ๐ โฒ โ { ๐ : ๐ฒ 1 , ๐ โค โ ๐ โ ๐ < 4 โ ๐ ๐ } . We always have ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ ๐ ) โ ๐ต ๐ โฒ .
Proof EC.4
Proof. To establish this, first note that when ๐
1 , the lemma directly follows from the assumption in Theorem 4.2 that max ๐ โ [ ๐พ ] โก | ๐ 1 โ ๐ โ ๐ ๐ | โค 2 . For ๐ โฅ 2 , using the same contradiction argument, assume that ๐ โ ( ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ ๐ ) โฉ ( ๐ต ๐ โฒ ) ๐ . Then we must have
๐ 1 , ๐ โค โ ๐ฝ โฅ 4 โ ๐ ๐ + ๐ .
(EC.6.7)
Consequently, by equation (EC.6.6), it follows that
๐ 1 , ๐ โค โ ๐ฝ ^ ๐ โ ๐ โฅ 2 โ ๐ ๐ .
(EC.6.8)
This is precisely the condition for the algorithm to add arm ๐ into ๐ต ๐ and eliminate it from ๐ ๐ผ โ ( ๐ โ 1 ) as specified in line 6 of the algorithm. This contradiction leads to the desired result. Moreover, when ๐ โฅ โ log 2 โก 4 ๐ฝ ๐ โ , we have ๐ต ๐ โฒ โฉ ๐บ ๐ ๐
โ . Finally, for ๐ โ ๐บ ๐ ๐ , it follows that ๐ฒ 1 , ๐ โค โ ๐ โ ๐
0 . \Halmos
We now present a critical lemma that establishes the connection between the lower bound in Theorem 3.2 and the upper bound. Define ๐ข ๐ด as the gauge of ๐ด โ ( ๐ ) , where ๐ denotes the initial set of all arm vectors. The details are provided in Lemma EC.3.
Lemma EC.5
Considering the lower bound ( ฮ โ ) โ 1 in Theorem 3.2, we have
( ฮ โ ) โ 1 โฅ ๐ข ๐ด 2 โ ๐ฟ 2 4 โ ๐ upper โ ๐ โ ๐ฟ 1 โ โ ๐
1 ๐ upper 2 2 โ ๐ โ 3 โ min ๐ โ ๐ ๐พ โก max ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ,
(EC.6.9)
where ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } , and ๐ฟ 1 and ๐ฟ 2 are constants.
Proof EC.6
Proof. When round ๐ exceeds ๐ upper , we have ๐บ ๐ โฒ โฉ ๐บ ๐
โ and ๐ต ๐ โฒ โฉ ๐บ ๐ ๐
โ , implying that ๐บ ๐ โช ๐ต ๐
[ ๐พ ] and the algorithm terminates. From Theorem 3.2, we obtain
( ฮ โ ) โ 1
min ๐ โ ๐ ๐พ โก max ( ๐ , ๐ , ๐ ) โ ๐ณ โก max โก { 2 โ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ 1 , ๐ โค โ ๐ฝ โ ๐ ) 2 }
min ๐ โ ๐ ๐พ โก max ๐ โค ๐ upper โก max ๐ โ ๐บ ๐ โฒ โฉ ๐บ ๐ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐ โ ๐ โก max ๐ โ ๐ต ๐ โฒ โฉ ๐บ ๐ ๐ โก max โก { 2 โ โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ ๐ , ๐ โค โ ๐ฝ + ๐ ) 2 , 2 โ โ ๐ 1 , ๐ โ ๐ฝ ๐ โ 1 2 ( ๐ 1 , ๐ โค โ ๐ฝ โ ๐ ) 2 }
โฅ min ๐ โ ๐ ๐พ โก max ๐ โค ๐ upper โก max ๐ โ ๐บ ๐ โฒ โฉ ๐บ ๐ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
โก
max
๐
โ
๐ต
๐
โฒ
โฉ
๐บ
๐
๐
โก
max
โก
{
2
โ
โ
๐
๐
,
๐
โ
๐ฝ
๐
โ
1
2
(
4
โ
๐
๐
)
2
,
2
โ
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
(
4
โ
๐
๐
)
2
}
โฅ
(i)
1
๐
upper
โ
min
๐
โ
๐
๐พ
โ
โ
๐
1 ๐ upper max ๐ โ ๐บ ๐ โฒ โฉ ๐บ ๐ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
โก
max
๐
โ
๐ต
๐
โฒ
โฉ
๐บ
๐
๐
โก
max
โก
{
2
โ
โ
๐
๐
,
๐
โ
๐ฝ
๐
โ
1
2
(
4
โ
๐
๐
)
2
,
2
โ
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
(
4
โ
๐
๐
)
2
}
โฅ
(ii)
1
๐
upper
โ
โ
๐
1 ๐ upper 2 2 โ ๐ โ 3 โ min ๐ โ ๐ ๐พ โก max ๐ โ ๐บ ๐ โฒ โฉ ๐บ ๐ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
โก
max
๐
โ
๐ต
๐
โฒ
โฉ
๐บ
๐
๐
โก
max
โก
{
โ
๐
๐
,
๐
โ
๐ฝ
๐
โ
1
2
,
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
}
โฅ
(iii)
1
๐
upper
โ
โ
๐
1 ๐ upper 2 2 โ ๐ โ 3 โ min ๐ โ ๐ ๐พ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โฉ ๐บ ๐ โฉ ๐บ ๐ ๐ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
โก
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โฉ
๐บ
๐
๐
โก
max
โก
{
โ
๐
๐
,
๐
โ
๐ฝ
๐
โ
1
2
,
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
}
โฅ
(iv)
๐ข
๐ด
2
โ
๐ฟ
2
๐
โ
๐ฟ
1
โ
1
๐
upper
โ
โ
๐
1
๐
upper
2
2
โ
๐
โ
3
โ
min
๐
โ
๐
๐พ
โก
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โฉ
๐บ
๐
{
1
}
โก
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โฉ
๐บ
๐
๐
โก
max
โก
{
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
,
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
}
โฅ
(v)
๐ข
๐ด
2
โ
๐ฟ
2
4
โ
๐
upper
โ
๐
โ
๐ฟ
1
โ
โ
๐
1 ๐ upper 2 2 โ ๐ โ 3 โ min ๐ โ ๐ ๐พ โก max ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐ โ ๐ โก โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2
๐ข ๐ด 2 โ ๐ฟ 2 32 โ ๐ upper โ ๐ โ ๐ฟ 1 โ โ ๐
1 ๐ upper 2 2 โ ๐ โ ๐ ๐ณ โ ๐ด โ ( ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) ) ,
(EC.6.10)
where (i) follows from the fact that the maximum of positive numbers is always greater than or equal to their average, and (ii) uses the fact that the minimum of a sum is greater than or equal to the sum of the minimums. (iii) arises from the set inclusion relationships established in Lemma EC.1 and Lemma EC.3. (iv) is a direct consequence of Lemma EC.3 with ๐
1
. Finally, for (v), note that for any
๐
,
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
, we have
max
๐
,
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
,
๐
โ
๐
โก
โ
๐ฒ
๐
,
๐
โ
๐
๐ฉ
โ
1
2
โค
4
โ
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
{
1
}
โก
โ
๐ฒ
1
,
๐
โ
๐
๐ฉ
โ
1
2
. \Halmos
Moreover, to provide additional insights into the G-optimal design, from equation (EC.5.43), we obtain
๐ผ
๐
โ
[
๐
โฃ
โฐ
1
]
โค
4
โ
log
โก
[
2
โ
๐พ
๐ฟ
โ
(
๐
upper
+
1
)
]
โ
โ
๐
1 ๐ upper ๐ โ 2 2 โ ๐ + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper .
(EC.6.11)
From inequalities (EC.6.10) and (EC.6.11), it follows that to align the upper and lower bounds, one must establish ๐ โค ๐ โ min ๐ โ ๐ ๐พ โก max ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , ๐ โ ๐ โก โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 for some universal constant. However, applying the KieferโWolfowitz Theorem from Lemma EC.5 yields only the reverse inequality
min ๐ โ ๐ ๐พ โก max ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , ๐ โ ๐ โก โ ๐ ๐ , ๐ โ ๐ฝ ๐ โ 1 2 โค 4 โ min ๐ โ ๐ ๐พ โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก โ ๐ ๐ โ ๐ฝ ๐ โ 1 2
4 โ ๐ .
(EC.6.12)
This result indicates that LinFACT-G cannot achieve the lower bound established in Theorem 3.2.
Appendix EC.7Proof of Theorem 4.2
The central idea of this proof is to establish a direct relationship between the lower bound and the key minimax summation terms in the upper bound, thereby enabling the sample complexity to be bounded explicitly in terms of the lower bound term ( ฮ โ ) โ 1 . We first show that the good event โฐ 3 defined below occurs with probability at least 1 โ ๐ฟ ๐ in each round ๐ , where ๐ฟ ๐ denotes the probability that the good event โฐ 3 does not hold in a certain round ๐ . By taking the union bound across different rounds, we can complete the proof of Lemma EC.1 regarding the overall probability of event โฐ 3 occurring, denoted by ๐ฟ . We then demonstrate that the probability of this good event holding in every round is at least 1 โ ๐ฟ . Consequently, we can sum the sample complexity bounds from each round (conditioned on the good event) to obtain the overall bound on the sample complexity.
First, we define the good event โฐ 3 :
โฐ 3
โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐ โ ๐ โ ๐ โ โ | ( ๐ ^ ๐ โ ( ๐ ) โ ๐ ^ ๐ โ ( ๐ ) ) โ ( ๐ ๐ โ ๐ ๐ ) | โค 2 โ ๐ ๐ .
(EC.7.1)
Since arms are sampled according to a preset allocation (i.e., a fixed design), we introduce the following lemma to provide a confidence region for the estimated parameter ๐ฝ .
Lemma EC.1
Let ๐ฟ โ ( 0 , 1 ) . Then, it holds that ๐ โ ( โฐ 3 ) โฅ 1 โ ๐ฟ .
Proof EC.2
Proof. Since ๐ ^ ๐ is a ordinary least squares estimator of ๐ and the noise is i.i.d., it follows that ๐ฒ โค โ ( ๐ โ ๐ ^ ๐ ) โ is โ โ ๐ฒ โ ๐ ๐ โ 1 2 -sub-Gaussian for all ๐ฒ โ ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) . Moreover, the guarantees of the rounding procedure ensure that
โ ๐ โ ๐ฝ ๐ โ 1 2 โค ( 1 + ๐ ) โ ๐ ๐ณ โ ๐ด โ ( ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) ) / ๐ ๐ โค 2 โ 2 โ ๐ โ 1 log โก 2 โ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 ) ๐ฟ
(EC.7.2)
for all ๐ฒ โ ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) , as ensured by our choice of ๐ ๐ in equation (22). Since the right-hand side is deterministic and does not depend on the randomness of the arm rewards, we have that for any ๐
0 and ๐ฒ โ ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) ,
โ โ { | ๐ โค โ ( ๐ฝ โ ๐ฝ ^ ๐ ) |
2 โ 2 โ ๐ โ log โก ( 2 / ๐ ) log โก 2 โ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 ) ๐ฟ } โค ๐ .
(EC.7.3)
Letting ๐
๐ฟ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 ) and applying a union bound over all possible ๐ฒ โ ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) , where | ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) | โค | ๐ด โ ( ๐ โ ( 0 ) ) | โค ๐พ โ ( ๐พ โ 1 ) , we obtain the desired probability guarantee:
โ โ ( โฐ 3 ๐ )
โ โ { โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
โ
๐
โ
โ
|
(
๐
^
๐
โ
(
๐
)
โ
๐
^
๐
โ
(
๐
)
)
โ
(
๐
๐
โ
๐
๐
)
|
>
๐
๐
}
โค
โ
๐
1 โ โ โ { โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 )
๐
โ
๐
|
(
๐
^
๐
โ
(
๐
)
โ
๐
^
๐
โ
(
๐
)
)
โ
(
๐
๐
โ
๐
๐
)
|
>
๐
๐
}
โค
โ
๐
1 โ โ ๐
1 ๐พ โ ๐
1
๐ โ ๐ ๐พ ๐ฟ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 )
๐ฟ .
(EC.7.4)
Taking the union bound over all rounds ๐ โ โ completes the proof. \Halmos
Thus, by the standard result of the ๐ณ โ ๐ด -optimal design, we have
๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ ๐ ,
(EC.7.5)
which matches the expression in equation (EC.5.5) for the G-optimal design.
Lemma EC.3
On the event โฐ 3 , the best arm 1 โ ๐ ๐ผ โ ( ๐ ) for all ๐ โ โ .
Proof EC.4
Proof. If the event โฐ 3 holds, then for any arm ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , we have
๐ ^ ๐ โ ( ๐ ) โ ๐ ^ 1 โ ( ๐ ) โค ๐ ๐ โ ๐ 1 + 2 โ ๐ ๐ โค 2 โ ๐ ๐ < 2 โ ๐ ๐ + ๐ ,
(EC.7.6)
which implies that ๐ ^ 1 โ ( ๐ ) + ๐ ๐ > max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ๐ โ ๐
๐ฟ ๐ and ๐ ^ 1 โ ( ๐ ) + ๐ ๐ โฅ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ( ๐ ) โ ๐ถ ๐ฟ / ๐พ โ ( ๐ ) .
These inequalities ensure that arm 1 will not be eliminated from ๐ ๐ผ โ ( ๐ ) in LinFACT. \Halmos
Lemma EC.5
With probability at least 1 โ ๐ฟ , and employing an ๐ -efficient rounding procedure, LinFACT- ๐ณ โ ๐ด successfully identifies all ๐ -best arms and achieves instance-optimal sample complexity up to logarithmic factors, as given by
๐ โค ๐ โ [ ๐ โ ๐ upper โ log โก ( 2 โ ๐พ โ ( ๐ upper + 1 ) ๐ฟ ) ] โ ( ฮ โ ) โ 1 + ๐ โ ( ๐ ) โ ๐ upper ,
(EC.7.7)
where ๐ is a universal constant, ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } .
Proof EC.6
Proof. Combining the result of Lemma EC.1, we conclude that with probability at least 1 โ ๐ฟ
๐
โค
โ
๐
1
๐
upper
max
โก
{
โ
2
โ
๐
๐ณ
โ
๐ด
โ
(
๐ด
โ
(
๐
โ
(
๐
โ
1
)
)
)
โ
(
1
+
๐
)
๐
๐
2
โ
log
โก
(
2
โ
๐พ
โ
(
๐พ
โ
1
)
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
โ
,
๐
โ
(
๐
)
}
โค
โ
๐
1 ๐ upper 2 โ 2 2 โ ๐ โ ๐ ๐ณ โ ๐ด โ ( ๐ด โ ( ๐ โ ( ๐ โ 1 ) ) ) โ ( 1 + ๐ ) โ log โก ( 2 โ ๐พ โ ( ๐พ โ 1 ) โ ๐ โ ( ๐ + 1 ) ๐ฟ ) + ( 1 + ๐ โ ( ๐ ) ) โ ๐ upper
โค [ 64 โ ( 1 + ๐ ) โ log โก ( 2 โ ๐พ โ ( ๐พ โ 1 ) โ ๐ upper โ ( ๐ upper + 1 ) ๐ฟ ) โ ๐ upper โ ๐ โ ๐ฟ 1 ๐ข ๐ด 2 โ ๐ฟ 2 ] โ ( ฮ โ ) โ 1 + ( 1 + ๐ โ ( ๐ ) ) โ ๐ upper
โค [ 128 โ ( 1 + ๐ ) โ log โก ( 2 โ ๐พ โ ( ๐ upper + 1 ) ๐ฟ ) โ ๐ upper โ ๐ โ ๐ฟ 1 ๐ข ๐ด 2 โ ๐ฟ 2 ] โ ( ฮ โ ) โ 1 + ( 1 + ๐ โ ( ๐ ) ) โ ๐ upper
โค ๐ โ [ ๐ โ ๐ upper โ log โก ( 2 โ ๐พ โ ( ๐ upper + 1 ) ๐ฟ ) ] โ ( ฮ โ ) โ 1 + ๐ โ ( ๐ ) โ ๐ upper ,
(EC.7.8)
where ๐ is a universal constant, ๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } , and the third inequality follows from equation (EC.6.10).
Then, let ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 be the minimum gap of the problem instance. Considering that the approximation error term ๐ โ ( ๐ ) is in the form of ๐ช โ ( ๐ ๐ 2 ) (Allen-Zhu et al. 2021, Fiez et al. 2019), we have
๐ผ โ [ ๐ ๐ณ โ ๐ด โฃ โฐ ]
๐ช โ ( ๐ ฮ โ โ ๐ โ 1 โ log โก ( ๐ โ 1 ) โ log โก ( ๐พ ๐ฟ โ log โก ( ๐ โ 2 ) ) + ๐ ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(EC.7.9) \Halmos Appendix EC.8Proof of Theorem 5.1 EC.8.1Step 1: Define the Clean Events
The core of the proof lies in similarly defining the round at which all classifications are completed under the misspecified model. To this end, we reconstruct the anytime confidence radius for each arm in round ๐ and redefine the high-probability event over the entire execution of the algorithm. We denote this clean event as โฐ 1 โ ๐ .
โฐ 1 โ ๐
{ โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ | โค ๐ ๐ + ๐ฟ ๐ โ ๐ } .
(EC.8.1)
Similarly, since arms are sampled according to a preset allocation (i.e., the optimal design criterion), we invoke Lemma EC.1 to derive an adjusted confidence region for the estimated parameter ๐ฝ ^ ๐ก . When following the G-optimal sampling rule, as specified in lines 2 and 4 of the pseudocode, we obtain the following result for each round ๐ :
๐ฝ ๐
โ ๐ โ Supp โ ( ๐ ๐ ) ๐ ๐ โ ( ๐ ) โ ๐ โ ๐ โค โชฐ 2 โ ๐ ๐ ๐ 2 โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ ๐ฝ โ ( ๐ ) .
(EC.8.2)
To give the confidence radius, we have the following decomposition.
| โจ ๐ฝ ^ ๐ โ ๐ฝ , ๐ โฉ |
| ๐ โค โ ๐ฝ ๐ โ 1 โ โ ๐
1 ๐ ๐ ๐ โ ( ๐ ๐ด ๐ ) โ ๐ ๐ด ๐ + ๐ โค โ ๐ฝ ๐ โ 1 โ โ ๐
1
๐
๐
๐
๐
โ
๐
๐ด
๐
|
โค
|
๐
โค
โ
๐ฝ
๐
โ
1
โ
โ
๐
1 ๐ ๐ ฮ ๐ โ ( ๐ ๐ด ๐ ) โ ๐ ๐ด ๐ | + | ๐ โค โ ๐ฝ ๐ โ 1 โ โ ๐
1 ๐ ๐ ๐ ๐ โ ๐ ๐ด ๐ | ,
(EC.8.3)
where the first term is bounded by
| ๐ โค โ ๐ฝ ๐ โ 1 โ โ ๐
1 ๐ ๐ ฮ ๐ โ ( ๐ ๐ด ๐ ) โ ๐ ๐ด ๐ |
|
๐
โค
โ
๐ฝ
๐
โ
1
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โ
ฮ
๐
โ
(
๐
)
โ
๐
|
โค
๐ฟ
๐
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โ
|
๐
โค
โ
๐ฝ
๐
โ
1
โ
๐
|
โค
๐ฟ
๐
โ
(
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
)
โ
๐
โค
โ
(
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โ
๐ฝ
๐
โ
1
โ
๐
โ
๐
โค
โ
๐ฝ
๐
โ
1
โ
๐
)
๐ฟ ๐ โ โ ๐ โ ๐ โ ( ๐ โ 1 ) ๐ ๐ โ ( ๐ ) โ โ ๐ โ ๐ฝ ๐ โ 1 2
โค ๐ฟ ๐ โ ๐ ,
(EC.8.4)
where the first inequality follows from Hรถlderโs inequality, the second from Jensenโs inequality, and the last from the guarantee of the G-optimal exploration policy, which ensures that โ ๐ โ ๐ฝ ๐ โ 1 2 โค ๐ / ๐ ๐ .
The second term is also bounded using Lemma EC.1 and the result in Lemma EC.1. For any arm ๐ โ ๐ โ ( ๐ โ 1 ) , with a probability of at least 1 โ ๐ฟ / ๐พ โ ๐ โ ( ๐ + 1 ) , we have
| ๐ โค โ ๐ฝ ๐ โ 1 โ โ ๐
1
๐
๐
๐
๐
โ
๐
๐ด
๐
|
โค
2
โ
โ
๐
โ
๐ฝ
๐
โ
1
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
2 โ ๐ โค โ ๐ฝ ๐ โ 1 โ ๐ โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ )
โค 2 โ ๐ โค โ ( ๐ ๐ 2 2 โ ๐ โ 1 log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) โ ๐ฝ โ ( ๐ ) โ 1 ) โ ๐ โ log โก ( 2 โ ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ )
โค ๐ ๐ .
(EC.8.5)
Thus, with the standard result of the G-optimal design, we also have
๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ ๐ .
(EC.8.6)
To establish the probability guarantee for event โฐ 1 โ ๐ , we combine the results from equations (EC.8.1) and (EC.8.1), yielding
โ โ ( โฐ 1 โ ๐ ๐ )
โ
โ
{
โ
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โ
๐
โ
โ
|
๐
^
๐
โ
(
๐
)
โ
๐
๐
|
>
๐
๐
+
๐ฟ
๐
โ
๐
}
โค
โ
๐
1
โ
โ
โ
{
โ
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
|
๐
^
๐
โ
(
๐
)
โ
๐
๐
|
>
๐
๐
+
๐ฟ
๐
โ
๐
}
โค
โ
๐
1 โ โ ๐
1 ๐พ ๐ฟ ๐พ โ ๐ โ ( ๐ + 1 )
๐ฟ .
(EC.8.7)
Therefore, taking the union bounds over rounds ๐ โ โ , we have
๐ โ ( โฐ 1 โ ๐ ) โฅ 1 โ ๐ฟ .
(EC.8.8)
Considering an additional event that characterizes the gaps between different arms, defined as follows
โฐ 2 โ ๐
โ ๐ โ ๐บ ๐ โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ( ๐ ^ ๐ โ ( ๐ ) โ ๐ ^ ๐ โ ( ๐ ) ) โ ( ๐ ๐ โ ๐ ๐ ) | โค 2 โ ๐ ๐ + 2 โ ๐ฟ ๐ โ ๐ .
(EC.8.9)
By equation (EC.8.1), for ๐ , ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , we have
โ
โ
{
|
(
๐
^
๐
โ
๐
^
๐
)
โ
(
๐
๐
โ
๐
๐
)
|
>
2
โ
๐
๐
+
2
โ
๐ฟ
๐
โ
๐
โฃ
โฐ
1
โ
๐
}
โค
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
+
|
๐
^
๐
โ
๐
๐
|
>
2
โ
๐
๐
+
2
โ
๐ฟ
๐
โ
๐
โฃ
โฐ
1
โ
๐
}
โค
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
>
๐
๐
+
๐ฟ
๐
โ
๐
โฃ
โฐ
1
โ
๐
}
+
โ
โ
{
|
๐
^
๐
โ
๐
๐
|
>
๐
๐
+
๐ฟ
๐
โ
๐
โฃ
โฐ
1
โ
๐
}
0 ,
(EC.8.10)
which implies
โ โ ( โฐ 2 โ ๐ โฃ โฐ 1 โ ๐ )
1 .
(EC.8.11) EC.8.2Step 2: Bound the Expected Sample Complexity
To bound the expected sample complexity under model misspecification, we aim to identify the round in which all arms ๐ โ ๐บ ๐ have been included in ๐บ ๐ , and the round in which all arms ๐ โ ๐บ ๐ ๐ have been added to ๐ต ๐ .
Lemma EC.1
For ๐ โ ๐บ ๐ and ๐ฟ ๐ < ๐ผ ๐ 2 โ ๐ , if ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 โ ๐ ]
0 .
Proof EC.2
Proof. First, for any ๐ โ ๐บ ๐
๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 โ ๐ ]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ , ๐ โ ๐บ ๐ โ ( ๐
{ 1 , 2 , โฆ , ๐ โ 1 } ) ]
โค ๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ ] .
(EC.8.12)
If ๐ โ ๐บ ๐ โ 1 , then ๐ โ ๐บ ๐ by definition. Otherwise, if ๐ โ ๐บ ๐ โ 1 , then under event โฐ 1 โ ๐ , for ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค ๐ arg โก max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ๐ + 2 โ ๐ + 1 + 2 โ ๐ฟ ๐ โ ๐ โค ฮ ๐ + 2 โ ๐ + 1 + 2 โ ๐ฟ ๐ โ ๐ โค ๐ โ 2 โ ๐ ๐ ,
(EC.8.13)
which implies that ๐ โ ๐บ ๐ by line 8 of the algorithm. In particular, under event โฐ 1 โ ๐ , if ๐ โ ๐บ ๐ โ 1 , for all ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค โ 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 โ ๐ ]
1 .
(EC.8.14)
Consequently, ๐ โ [ ๐ โ ๐บ ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ]
0 . Therefore,
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
โฐ
1
โ
๐
]
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
โฐ
1
โ
๐
]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ โฐ 1 โ ๐ ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
๐
โ
๐บ
๐
โ
1
,
โฐ
1
โ
๐
]
โ
โ
โ
(
๐
โ
๐บ
๐
โ
1
โฃ
โฐ
1
โ
๐
)
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โฅ
โ
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐บ
๐
โ
1
]
โฃ
๐
โ
๐บ
๐
โ
1
,
โฐ
1
โ
๐
]
โ
โ
โ
(
๐
โ
๐บ
๐
โ
1
โฃ
โฐ
1
โ
๐
)
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 โ ๐ ] โ โ โ ( ๐ โ ๐บ ๐ โ 1 โฃ โฐ 1 โ ๐ )
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ โ 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐บ ๐ โ 1 , โฐ 1 โ ๐ ] โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ] โฃ โฐ 1 โ ๐ ]
0 ,
(EC.8.15)
where the second line follows from the additivity of expectation. The fourth line follows the deterministic result that ๐ โ [ ๐ โ ๐บ ๐ ] โ ๐ โ [ ๐ โ ๐บ ๐ โ 1 ]
0 . The fifth line is the decomposition based on the conditional expectation. The eighth line comes from the fact that the expectation of the indicator function is simply the probability. The last line follows the result in equation (EC.8.14). The lemma can thus be concluded together with equation (EC.2).
In the perfectly linear model, ๐ โ ฮ ๐
0 always holds for any ๐ โ ๐บ ๐ . However, under model misspecification, the sign of this term within the logarithm must be verified. To ensure positivity for all ๐ โ ๐บ ๐ , it is necessary that ๐ผ ๐
2 โ ๐ฟ ๐ โ ๐ . As the misspecification magnitude ๐ฟ ๐ approaches ๐ผ ๐ / ( 2 โ ๐ ) , the upper bound on the expected sample complexity increases sharply, since the misspecification significantly impairs the identification of ๐ -best arms. Moreover, when ๐ฟ ๐ โฅ ๐ผ ๐ / ( 2 โ ๐ ) , the sample complexity can no longer be bounded in this form, which is intuitive and consistent with the general insights discussed earlier in Section 5.3. \Halmos
Lemma EC.3
For ๐ โ ๐บ ๐ ๐ and ๐ฟ ๐ < ๐ฝ ๐ 2 โ ๐ , if ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 โ ๐ ]
0 .
Proof EC.4
Proof. First, we for any ๐ โ ๐บ ๐ ๐ ,
๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 โ ๐ ]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ , ๐ โ ๐ต ๐ โ ( ๐
{ 1 , 2 , โฆ , ๐ โ 1 } ) ]
โค ๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ ] .
(EC.8.16)
If ๐ โ ๐ต ๐ โ 1 , then ๐ โ ๐ต ๐ by definition. Otherwise, if ๐ โ ๐ต ๐ โ 1 , then under event โฐ 1 โ ๐ , for ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have
max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โฅ ฮ ๐ โ 2 โ ๐ + 1 โ 2 โ ๐ฟ ๐ โ ๐ โฅ ๐ + 2 โ ๐ ๐ ,
(EC.8.17)
which implies that ๐ โ ๐ต ๐ by line 6 of the algorithm. In particular, under event โฐ 1 โ ๐ , if ๐ โ ๐ต ๐ โ 1 , for all ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ > 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 โ ๐ ]
1 .
(EC.8.18)
Deterministically, ๐ โ [ ๐ โ ๐ต ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ]
0 . Therefore,
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ โฐ 1 โ ๐ ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
โ
๐
]
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
โฐ
1
โ
๐
]
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 โ ๐ ]
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
๐
โ
๐ต
๐
โ
1
,
โฐ
1
โ
๐
]
โ
โ
โ
(
๐
โ
๐ต
๐
โ
1
โฃ
โฐ
1
โ
๐
)
+
๐ผ
๐
โ
[
๐
โ
[
max
๐
โ
๐
๐ผ
โ
(
๐
โ
1
)
โก
๐
^
๐
โ
๐
^
๐
โค
2
โ
๐
๐
+
๐
]
โ
๐
โ
[
๐
โ
๐ต
๐
โ
1
]
โฃ
๐
โ
๐ต
๐
โ
1
,
โฐ
1
โ
๐
]
โ
โ
โ
(
๐
โ
๐ต
๐
โ
1
โฃ
โฐ
1
โ
๐
)
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 โ ๐ ] โ โ โ ( ๐ โ ๐ต ๐ โ 1 โฃ โฐ 1 โ ๐ )
๐ผ ๐ โ [ ๐ โ [ max ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โก ๐ ^ ๐ โ ๐ ^ ๐ โค 2 โ ๐ ๐ + ๐ ] โฃ ๐ โ ๐ต ๐ โ 1 , โฐ 1 โ ๐ ] โ ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ] โฃ โฐ 1 โ ๐ ]
0 .
(EC.8.19)
The second line follows from the linearity of expectation. The fourth line uses the deterministic fact that ๐ โ [ ๐ โ ๐ต ๐ ] โ ๐ โ [ ๐ โ ๐ต ๐ โ 1 ]
0 . The fifth line applies the law of total expectation. The eighth line uses the identity that the expectation of an indicator function equals the corresponding probability. The final line follows from equation (EC.8.18). The lemma then follows by combining this result with equation (EC.4).
Similarly, we must ensure the positivity of the term inside the logarithm. To guarantee that ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ๐
0 for every ๐ โ ๐บ ๐ ๐ , it is necessary that ๐ฝ ๐
2 โ ๐ฟ ๐ โ ๐ . \Halmos
Lemma EC.5
Suppose ๐ฟ ๐ < min โก { ๐ผ ๐ 2 โ ๐ , ๐ฝ ๐ 2 โ ๐ } , then the round by which all arms are classified is ๐ upper โฒ
max โก { โ log 2 โก 4 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ โ } under model misspecification.
Proof EC.6
Proof. Combining the results of Lemmas EC.1 and EC.3, we define an auxiliary round ๐ ๐ to facilitate the derivation of the upper bound. Specifically, for ๐ โ ๐บ ๐ ๐ and ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 โ ๐ ]
0 . Similarly, for ๐ โ ๐บ ๐ and ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ , we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 โ ๐ ]
0 .
Noting that ๐ผ ๐
min ๐ โ ๐บ ๐ โก ( ๐ โ ฮ ๐ ) and ๐ฝ ๐
min ๐ โ ๐บ ๐ ๐ โก ( ฮ ๐ โ ๐ ) , it follows that for any round ๐ โฅ ๐ upper โฒ , all arms have been included in either ๐บ ๐ or ๐ต ๐ , marking the termination of the algorithm.
\Halmos Lemma EC.7
Under model specification, for the expected sample complexity conditioned on the high-probability event โฐ 1 โ ๐ , we have
๐ผ ๐ โ [ ๐ ๐บ mis โฃ โฐ 1 โ ๐ ]
โค ๐ max { 256 โ ๐ ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) 2 log ( 2 โ ๐พ ๐ฟ log 2 16 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) ,
256 โ ๐ ( ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) 2 log ( 2 โ ๐พ ๐ฟ log 2 16 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 ๐ upper โฒ ,
(EC.8.20)
where ๐ is a universal constant and ๐ upper โฒ
max โก { โ log 2 โก 4 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ โ } .
Proof EC.8
Proof. We can also decompose ๐ as in equation (EC.5.26), where all expectations are conditioned on the high-probability event โฐ 1 โ ๐ , as given by
๐ผ
๐
โ
[
๐
๐บ
mis
โฃ
โฐ
1
โ
๐
]
โค
โ
๐
1
โ
๐ผ
๐
โ
[
๐
โ
[
๐บ
๐
โช
๐ต
๐
โ
[
๐พ
]
]
โฃ
โฐ
1
โ
๐
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
1
๐
upper
โฒ
(
๐
โ
2
2
โ
๐
+
1
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
โค
๐
โ
(
๐
+
1
)
2
โ
๐
upper
โฒ
+
2
โ
๐
โ
log
โก
(
2
โ
๐พ
๐ฟ
)
โ
โ
๐
1 ๐ upper โฒ 2 2 โ ๐ + 4 โ ๐ โ โ ๐
1
๐
upper
โฒ
2
2
โ
๐
โ
log
โก
(
๐
+
1
)
โค
4
โ
log
โก
[
2
โ
๐พ
๐ฟ
โ
(
๐
upper
โฒ
+
1
)
]
โ
โ
๐
1 ๐ upper โฒ ๐ โ 2 2 โ ๐ + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper โฒ
โค ๐ max { 256 โ ๐ ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) 2 log ( 2 โ ๐พ ๐ฟ log 2 16 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) ,
256 โ ๐ ( ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) 2 log ( 2 โ ๐พ ๐ฟ log 2 16 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 ๐ upper โฒ .
(EC.8.21)
Then, let ๐
min โก ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ๐ , ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) / 16 , we have
๐ผ ๐ โ [ ๐ ๐บ mis โฃ โฐ 1 โ ๐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(EC.8.22) \Halmos Appendix EC.9Proof of Theorem 5.2 EC.9.1Step 1: Confidence Radius
In equation (EC.8.1), the first termโarising from the unknown model misspecificationโis unavoidable without prior knowledge. However, rather than focusing on the distance between the true parameter ๐ฝ and its estimator ๐ฝ ^ ๐ก , we instead compare the orthogonal projection ๐ฝ ๐ก with ๐ฝ ^ ๐ก in the direction of ๐ โ โ ๐ .
Building on the definition of the empirically optimal vector ๐ ^ ๐ โ ( ๐ ) , with ( ๐ฝ ^ ๐ โ ( ๐ ) , ๐ซ ^ ๐ โ ๐ โ ( ๐ ) ) as its associated solution, and the orthogonal parameterization ( ๐ฝ ๐ , ๐ซ ๐ โ ( ๐ ) ) , we derive the confidence radius for each arm via the following decomposition. For any ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) , let ๐ ^ ๐ โ ( ๐ )
๐ ^ ๐ , ๐ โ ( ๐ ) denote the value of the optimal estimator on arm ๐ in round ๐ , then we have
| ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ |
| โจ ๐ฝ ^ ๐ โ ( ๐ ) โ ๐ฝ , ๐ ๐ โฉ + ๐ซ ^ ๐ โ ๐ โ ( ๐ ) โ ฮ ๐ โ ๐ |
โค | โจ ๐ฝ ^ ๐ โ ( ๐ ) โ ๐ฝ ๐ , ๐ ๐ โฉ | + | โจ ๐ฝ ๐ โ ๐ฝ , ๐ ๐ โฉ | + | ๐ซ ^ ๐ โ ๐ โ ( ๐ ) โ ฮ ๐ โ ๐ | ,
(EC.9.1)
where the third term is bounded by definition as | ๐ซ ^ ๐ โ ๐ โ ( ๐ ) โ ฮ ๐ โ ๐ | โค 2 โ ๐ฟ ๐ , while the first two terms can be bounded using the auxiliary lemmas provided below.
Lemma EC.1
Let ๐ be any round such that ๐ ๐ is invertible. Consider the orthogonal parameterization ( ๐ ๐ , ๐ซ ๐ ( ๐ ) for ๐
๐ฟ โ ๐ + ๐ซ ๐ with โ ๐ซ ๐ฆ โ โ โค ๐ฟ ๐ . Then
โ ๐ฝ ๐ โ ๐ฝ โ ๐ฝ ๐ โค ๐ฟ ๐ โ ๐ ๐ ,
(EC.9.2)
where ๐ ๐ is defined in equation (33).
Proof EC.2
Proof. We use the expression ๐ ๐
๐ + ๐ ๐ โ 1 โ ๐ฟ โค โ ๐ ๐ ๐ โ ๐ซ ๐ derived above. Let ๐ ๐ ๐
๐ฟ ๐ ๐ โ ( ๐ฟ ๐ ๐ โค โ ๐ฟ ๐ ๐ ) โ โ ๐ฟ ๐ ๐ โค be a projection, we have
โ ๐ฝ ๐ โ ๐ฝ โ ๐ฝ ๐
โ ๐ฝ ๐ โ 1 โ ๐ฟ โค โ ๐ซ ๐ต ๐ โ ๐ซ ๐ โ ๐ฝ ๐
๐ซ ๐ โค โ ๐ซ ๐ต ๐ โ ๐ฟ โ ๐ฝ ๐ โ 1 โ ๐ฟ โค โ ๐ซ ๐ต ๐ โ ๐ซ ๐
โ
๐ซ
๐ต
๐
1
/
2
โ
๐ซ
๐
โ
๐ท
๐ต
๐
โค
โ
๐ซ
๐ต
๐
1
/
2
โ
๐ซ
๐
โ
โ ๐ซ ๐ โ ๐ท ๐ต ๐
โค ๐ฟ ๐ โ ๐ ๐ .
(EC.9.3) \Halmos Lemma EC.3
(Rรฉda et al. (2021), Lemma 10). Let ๐ ^ ๐ โ ( ๐ )
๐ฟ โ ๐ ^ ๐ โ ( ๐ ) + ๐ซ ^ ๐ โ ๐ โ ( ๐ ) in round ๐ , where ( ๐ ^ ๐ โ ( ๐ ) , ๐ซ ^ ๐ โ ๐ โ ( ๐ ) ) are the solution of (30).Then the following relationship holds.
โ ๐ฝ ^ ๐ โ ( ๐ ) โ ๐ฝ ๐ โ ๐ฝ ๐ 2 โค โ ๐ฝ ^ ๐ โ ๐ฝ ๐ โ ๐ฝ ๐ 2 .
(EC.9.4) Lemma EC.4
(Lattimore and Szepesvรกri (2020), Section 20). Let ๐ฟ โ ( 0 , 1 ) . Then, with a probability of at least 1 โ ๐ฟ , it holds that for all ๐ก โ โ ,
โ ๐ฝ ^ ๐ก โ ๐ฝ โ ๐ ๐ก < 2 โ 2 โ ( ๐ โ log โก ( 6 ) + log โก ( 1 ๐ฟ ) ) .
(EC.9.5)
Equation (EC.9.5) is not directly applicable in our setting due to the deviation term introduced by model misspecification, as shown in equation (EC.8.1). However, by leveraging the orthogonal parameterization, we obtain
โ ๐ฝ ^ ๐ โ ๐ฝ ๐ โ ๐ฝ ๐ 2
โ ๐ฝ ๐ โ 1 โ ฮจ โค โ ๐ ๐ โ ๐ฝ ๐ 2
โ ฮจ โค โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 ,
(EC.9.6)
where ๐ ๐
โ ๐
1 ๐ ๐ ๐ ๐ด ๐ โ ๐ ๐ด ๐ is the standard self-normalized quantity in the linear bandit literature, allowing existing techniquesโsuch as various concentration inequalitiesโto be applied directly in the presence of model misspecification. This observation leads to the conclusion that, under misspecification, for any round ๐ โ โ , it holds with probability at least 1 โ ๐ฟ that
โ ๐ฝ ^ ๐ โ ๐ฝ ๐ โ ๐ ๐ก < 2 โ 2 โ ( ๐ โ log โก ( 6 ) + log โก ( 1 ๐ฟ ) ) .
(EC.9.7)
This result serves as the basis for designing the sampling budget and conducting theoretical analyses.
Together with Lemmas EC.1, EC.3, and EC.4, the distance | ๐ ~ ๐ โ ( ๐ ) โ ๐ ๐ | can be bounded as follows: with probability at least 1 โ ๐ฟ / ( ๐พ โ ๐ โ ( ๐ + 1 ) ) , we have
| ๐ ~ ๐ โ ( ๐ ) โ ๐ ๐ |
โค | โจ ๐ฝ ^ ๐ โ ( ๐ ) โ ๐ฝ ๐ , ๐ ๐ โฉ | + | โจ ๐ฝ ๐ โ ๐ฝ , ๐ ๐ โฉ | + | ๐ซ ^ ๐ โ ๐ โ ( ๐ ) โ ฮ ๐ โ ๐ |
โค โ ๐ฝ ^ ๐ โ ( ๐ ) โ ๐ฝ ๐ โ ๐ฝ ๐ โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 + โ ๐ฝ ๐ โ ๐ฝ โ ๐ฝ ๐ โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 + 2 โ ๐ฟ ๐
โค โ ๐ฝ ^ ๐ โ ๐ฝ ๐ โ ๐ฝ ๐ โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 + โ ๐ฝ ๐ โ ๐ฝ โ ๐ฝ ๐ โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 + 2 โ ๐ฟ ๐
โค ( 2 โ 2 โ ( ๐ โ log โก ( 6 ) + log โก ( ๐พ โ ๐ โ ( ๐ + 1 ) ๐ฟ ) ) + ๐ฟ ๐ โ ๐ ๐ ) โ ๐ ๐ ๐ + 2 โ ๐ฟ ๐
โค ๐ ๐ + ( ๐ + 2 ) โ ๐ฟ ๐ .
(EC.9.8)
Then, we define a new clean event
โฐ 1 โ ๐ โฒ
{ โ ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) โ ๐ โ โ | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ | โค ๐ ๐ + ( ๐ + 2 ) โ ๐ฟ ๐ } .
(EC.9.9)
This mirrors the event defined in equation (EC.8.1), allowing us to derive all corresponding results in Section EC.8.
EC.9.2Step 2: Bound the Expected Sample Complexity Lemma EC.5
For ๐ โ ๐บ ๐ and ๐ฟ ๐ < ๐ผ ๐ 2 โ ( ๐ + 2 ) , if ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 โ ๐ โฒ ]
0 .
Lemma EC.6
For ๐ โ ๐บ ๐ ๐ and ๐ฟ ๐ < ๐ฝ ๐ 2 โ ( ๐ + 2 ) , if ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 โ ๐ โฒ ]
0 .
Lemma EC.7
For the misspecification magnitude ๐ฟ ๐ < min โก { ๐ผ ๐ 2 โ ( ๐ + 2 ) , ๐ฝ ๐ 2 โ ( ๐ + 2 ) } , the round ๐ upper โฒโฒ
max โก { โ log 2 โก 4 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) โ , โ log 2 โก 4 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) โ } marks the point at which all classifications are completed and the algorithm terminates under model misspecification when using an estimation procedure based on orthogonal parameterization.
The proofs of Lemmas EC.5, EC.6, and EC.7 follow similar arguments to those presented in Section EC.8 and are therefore omitted for brevity.
Lemma EC.8
For the expected sample complexity given the high probability event โฐ 1 โ ๐ โฒ , we have
๐ผ ๐ โ [ ๐ ๐บ op โฃ โฐ 1 โ ๐ โฒ ]
โค ๐ max { 256 โ ๐ ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) 2 log ( ๐พ โ 6 ๐ ๐ฟ log 2 8 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) ,
256 โ ๐ ( ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) 2 log ( ๐พ โ 6 ๐ ๐ฟ log 2 8 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) } + ๐ โ ( ๐ + 1 ) 2 ๐ upper โฒโฒ ,
(EC.9.10)
where ๐ is a universal constant and ๐ upper โฒโฒ
max โก { โ log 2 โก 4 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) โ , โ log 2 โก 4 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) โ } denotes the round in which all classifications are completed and the algorithm terminates under model misspecification with orthogonal parameterization.
Proof EC.9
Proof. The decomposition of ๐ in equation (EC.5.26) can be reformulated, where the expectations are conditioned on the high-probability event โฐ 1 โ ๐ โฒ , given by
๐ผ
๐
โ
[
๐
๐บ
op
โฃ
โฐ
1
โ
๐
โฒ
]
โค
โ
๐
1
โ
๐ผ
๐
โ
[
๐
โ
[
๐บ
๐
โช
๐ต
๐
โ
[
๐พ
]
]
โฃ
โฐ
1
โ
๐
โฒ
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
1
๐
upper
โฒโฒ
(
๐
โ
2
2
โ
๐
+
3
โ
(
๐
โ
log
โก
(
6
)
+
log
โก
(
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
)
+
๐
โ
(
๐
+
1
)
2
)
โค
๐
โ
(
๐
+
1
)
2
โ
๐
upper
โฒโฒ
+
8
โ
๐
โ
log
โก
(
๐พ
โ
6
๐
๐ฟ
)
โ
โ
๐
1 ๐ upper โฒโฒ 2 2 โ ๐ + 16 โ ๐ โ โ ๐
1
๐
upper
โฒโฒ
2
2
โ
๐
โ
log
โก
(
๐
+
1
)
โค
16
โ
log
โก
[
๐พ
โ
6
๐
๐ฟ
โ
(
๐
upper
โฒโฒ
+
1
)
]
โ
โ
๐
1 ๐ upper โฒโฒ ๐ โ 2 2 โ ๐ + ๐ โ ( ๐ + 1 ) 2 โ ๐ upper โฒโฒ
(EC.9.11)
โค ๐ max { 256 โ ๐ ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) 2 log ( ๐พ โ 6 ๐ ๐ฟ log 2 16 ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) ,
256 โ ๐ ( ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) 2 log ( ๐พ โ 6 ๐ ๐ฟ log 2 16 ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) } + ๐ โ ( ๐ + 1 ) 2 ๐ upper โฒโฒ ,
(EC.9.12)
where ๐ is a universal constant. Then, let ๐
min โก ( ๐ผ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) , ๐ฝ ๐ โ 2 โ ๐ฟ ๐ โ ( ๐ + 2 ) ) / 16 , we have
๐ผ ๐ โ [ ๐ ๐บ op โฃ โฐ ]
๐ช โ ( ๐ โ ๐ โ 2 โ log โก ( ๐พ โ 6 ๐ ๐ฟ โ log โก ( ๐ โ 1 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(EC.9.13) \Halmos Appendix EC.10Proof of Proposition 5.4
The proof of this proposition closely follows that of Theorem 5.1 in Section EC.8, with the only modification being the definition of ๐ถ ๐ฟ / ๐พ โ ( ๐ ) , which is now set to ๐ ๐ + ๐ฟ ๐ โ ๐ given that ๐ฟ ๐ is known in advance. The conclusions in Section EC.8.1 remain valid. Additionally, the marginal round in Lemma EC.1 is updated from โ log 2 โก ( 4 ๐ โ ฮ ๐ โ 2 โ ๐ฟ ๐ โ ๐ ) โ to โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ . The remainder of the proof follows directly by applying the same reasoning steps.
Appendix EC.11Proof of Theorem 6.1 EC.11.1Step 1: Rearrange the Clean Event
Following the derivation approach in Section EC.5.5, the core of the proof is to similarly identify the round in which all classifications are completed under the GLM setting. To this end, we reconstruct the anytime confidence radius for the arms in each round ๐ and define a high-probability event that holds throughout the execution of the algorithm.
Let ๐ฝ ห ๐
โ ๐
1 ๐ ๐ ๐ ห link โ ( ๐ โค โ ๐ฝ ห ๐ ) โ ๐ ๐ โ ๐ ๐ โค , where ๐ฝ ห ๐ is some convex combination of true parameter ๐ฝ and parameter ๐ฝ ^ ๐ based on maximum likelihood estimation (MLE). It can be checked that the unweighted matrix ๐ฝ ๐
โ ๐
1 ๐ ๐ ( ๐ โค โ ๐ฝ ห ๐ ) โ ๐ ๐ โ ๐ ๐ โค in the standard linear model is the special case of this newly defined matrix ๐ฝ ห ๐ when the inverse link function ๐ link โ ( ๐ฅ )
๐ฅ .
For each arm ๐ , we define the following auxiliary vector
๐ ๐
( ๐ ๐ , 1 , ๐ ๐ , 2 , โฆ , ๐ ๐ , ๐ ๐ )
๐ ๐ โค โ ๐ฝ ห ๐ โ 1 โ ( ๐ ๐ด 1 , ๐ ๐ด 2 , โฆ , ๐ ๐ด ๐ ๐ ) โ โ ๐ ๐ ,
(EC.11.1)
and thus we have
โ ๐ ๐ โ 2 2
๐ ๐ โ ๐ ๐ โค
๐ ๐ โค โ ๐ฝ ห ๐ โ 1 โ ๐ฝ ๐ โ ๐ฝ ห ๐ โ 1 โ ๐ ๐ .
(EC.11.2)
To give the confidence radius under GLM, we have the following statement for any arm ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) in round ๐ .
| ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ |
| ๐ ๐ โค โ ( ๐ฝ ^ ๐ โ ๐ฝ ) |
| ๐ ๐ โค โ ๐ฝ ห ๐ โ 1 โ โ ๐
1 ๐ ๐ ๐ ๐ด ๐ โ ๐ ๐ |
| โ ๐
1 ๐ ๐ ๐ ๐ , ๐ โ ๐ ๐ด ๐ | ,
(EC.11.3)
where the second equality is established with Lemma 1 of Kveton et al. (2023) and ๐ ๐ , which is defined in equation (40), is the adjusted sampling budget in each round ๐ for the GLM. Since ( ๐ ๐ด ๐ ) ๐ โ ๐ ๐ are independent, mean zero, 1-sub-Gaussian random variables, then โ ๐
1 ๐ ๐ ๐ ๐ , ๐ โ ๐ ๐ด ๐ is a โ ๐ ๐ โ 2 -sub-Gaussian variable for each arm ๐ , then we have
โ โ ( | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ |
๐ ๐ )
โค 2 โ exp โ ( โ ๐ ๐ 2 2 โ โ ๐ ๐ โ 2 2 ) .
(EC.11.4)
Since ๐ฝ ห ๐ is not known in the process, we need to find another way to represent this term. By assumption, we know ๐ ห link โฅ ๐ min for some ๐ min โ โ + and for all ๐ โ ๐ ๐ผ โ ( ๐ โ 1 ) . Therefore ๐ min โ 1 โ ๐ฝ ๐ โ 1 โชฐ ๐ฝ ห ๐ โ 1 by definition of ๐ฝ ห ๐ , and then we have
โ ๐ ๐ โ 2 2
๐
๐
โค
โ
๐ฝ
ห
๐
โ
1
โ
๐ฝ
๐
โ
๐ฝ
ห
๐
โ
1
โ
๐
๐
โค
๐
๐
โค
โ
๐
min
โ
1
โ
๐ฝ
๐
โ
1
โ
๐ฝ
๐
โ
๐
min
โ
1
โ
๐ฝ
๐
โ
1
โ
๐
๐
๐ min โ 2 โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 .
(EC.11.5)
Furthermore, if G-optimal design is considered, we have โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 โค ๐ ๐ ๐ . Together with equation (EC.11.4), we have
โ โ ( | ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ |
๐ ๐ )
โค 2 โ exp โ ( โ ๐ ๐ 2 2 โ ๐ min โ 2 โ โ ๐ ๐ โ ๐ฝ ๐ โ 1 2 )
โค 2 โ exp โ ( โ ๐ ๐ 2 โ ๐ min 2 2 โ ๐ โ ๐ ๐ ) .
(EC.11.6)
Finally, considering the definition of ๐ ๐ in equation (40), with a probability of at least 1 โ ๐ฟ / ๐พ โ ๐ โ ( ๐ + 1 ) , we have
| ๐ ^ ๐ โ ( ๐ ) โ ๐ ๐ | โค ๐ ๐ .
(EC.11.7)
Thus, with the standard result of the G-optimal design, we still have
๐ถ ๐ฟ / ๐พ โ ( ๐ ) โ ๐ ๐ ,
(EC.11.8)
with which the events โฐ 1 and โฐ 2 in Section EC.5.1 hold with a probability of at least 1 โ ๐ฟ .
EC.11.2Step 2: Bound the Expected Sample Complexity Lemma EC.1
For ๐ โ ๐บ ๐ , if ๐ โฅ โ log 2 โก ( 4 ๐ โ ฮ ๐ ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐บ ๐ ] โฃ โฐ 1 ]
0 .
Lemma EC.2
For ๐ โ ๐บ ๐ ๐ , if ๐ โฅ โ log 2 โก ( 4 ฮ ๐ โ ๐ ) โ , then we have ๐ผ ๐ โ [ ๐ โ [ ๐ โ ๐ต ๐ ] โฃ โฐ 1 ]
0 .
Lemma EC.3
๐ GLM
๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } is the round where all the classifications have been finished and the answer is returned under GLM.
The proofs of Lemmas EC.1, EC.2, and EC.3 closely follow those in Section EC.5.5.
Lemma EC.4
For the expected sample complexity with high probability event โฐ 1 , we have
๐ผ ๐ โ [ ๐ GLM โฃ โฐ 1 ]
โค ๐ โ max โก { 256 โ ๐ ๐ผ ๐ 2 โ ๐ min 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ผ ๐ ) , 256 โ ๐ ๐ฝ ๐ 2 โ ๐ min 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ฝ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 โ ๐ GLM ,
(EC.11.9)
where ๐ is a universal constant, ๐ min is the known constant controlling the first-order derivative of the inverse link function, and ๐ GLM
๐ upper
max โก { โ log 2 โก 4 ๐ผ ๐ โ , โ log 2 โก 4 ๐ฝ ๐ โ } is the round where all the classifications have been finished and the answer is returned under GLM.
Proof EC.5
Proof. We also consider the decomposition of ๐ in equation (EC.5.26), where all expectations are conditioned on the high-probability event โฐ 1 , given by
๐ผ
๐
โ
[
๐
GLM
โฃ
โฐ
1
]
โค
โ
๐
1
โ
๐ผ
๐
โ
[
๐
โ
[
๐บ
๐
โช
๐ต
๐
โ
[
๐พ
]
]
โฃ
โฐ
1
]
โ
โ
๐
โ
๐
โ
(
๐
โ
1
)
๐
๐
โ
(
๐
)
โค
โ
๐
1
๐
GLM
(
๐
โ
2
2
โ
๐
+
1
๐
min
2
โ
log
โก
(
2
โ
๐พ
โ
๐
โ
(
๐
+
1
)
๐ฟ
)
+
๐
โ
(
๐
+
1
)
2
)
โค
๐
โ
(
๐
+
1
)
2
โ
๐
GLM
+
2
โ
๐
min
โ
2
โ
๐
โ
log
โก
(
2
โ
๐พ
๐ฟ
)
โ
โ
๐
1 ๐ GLM 2 2 โ ๐ + 4 โ ๐ min โ 2 โ ๐ โ โ ๐
1
๐
GLM
2
2
โ
๐
โ
log
โก
(
๐
+
1
)
โค
4
โ
๐
min
โ
2
โ
log
โก
[
2
โ
๐พ
๐ฟ
โ
(
๐
GLM
+
1
)
]
โ
โ
๐
1 ๐ GLM ๐ โ 2 2 โ ๐ + ๐ โ ( ๐ + 1 ) 2 โ ๐ GLM
(EC.11.10)
โค ๐ โ max โก { 256 โ ๐ ๐ผ ๐ 2 โ ๐ min 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ผ ๐ ) , 256 โ ๐ ๐ฝ ๐ 2 โ ๐ min 2 โ log โก ( 2 โ ๐พ ๐ฟ โ log 2 โก 16 ๐ฝ ๐ ) } + ๐ โ ( ๐ + 1 ) 2 โ ๐ GLM ,
(EC.11.11)
where ๐ is a universal constant. Then, let ๐
min โก ( ๐ผ ๐ , ๐ฝ ๐ ) / 16 denoting the minimum gap of the problem instance, we have
๐ผ โ [ ๐ GLM โฃ โฐ ]
๐ช โ ( ๐ ๐ min 2 โ ๐ โ 2 โ log โก ( ๐พ ๐ฟ โ log 2 โก ( ๐ โ 2 ) ) + ๐ 2 โ log โก ( ๐ โ 1 ) ) .
(EC.11.12) \Halmos Appendix EC.12Detailed Settings for Synthetic Experiments
We recap the figure here for better clarity.
(a)Synthetic I - Adaptive Setting (b)Synthetic II - Static Setting Figure EC.12.1:Illustration on Synthetic Settings EC.12.1Synthetic I - Adaptive Setting.
First, we randomly sample ๐ ~ , representing the number of ๐ -best arms, from a distribution with an expected value of ๐ (used as input for a top ๐ algorithm). Next, we randomly sample ๐ ~ , representing the best arm reward minus ๐ , from a distribution with an expected value of ๐ (used as input for a threshold bandit algorithm). We then assign ๐ ~
๐ -best arms with expected rewards uniformly distributed between ๐ ~ + ๐ and ๐ ~ . Additionally, we assign ( 1.5 โ ๐ โ ๐ ~ ) arms that are not ๐ -best with expected rewards uniformly distributed between ๐ ~ and ๐ ~ โ ๐ , as illustrated in Figure EC.12.1.
Based on these designed arm rewards, we define the linear model parameter as:
๐ฝ
( ๐ ~ + ๐ , ๐ ~ + ( ๐ ~ โ 1 ) โ ๐ / ๐ ~ , โฆ , ๐ ~ + ๐ / ๐ ~ , 0 , โฆ , 0 ) โค .
Arms are ๐ -dimensional canonical basis ๐ 1 , ๐ 2 , โฆ , ๐ ๐ and ( 1.5 โ ๐ โ ๐ ~ ) additional disturbing arms
๐ ๐
( ๐ ~ โ ( 1.5 โ ๐ โ ๐ ~ โ ๐ ) โ ๐ / ( 1.5 โ ๐ โ ๐ ~ ) ๐ ~ + ๐ , 0 , โฏ , 0 , 1 โ ( ๐ ~ โ ( 1.5 โ ๐ โ ๐ ~ โ ๐ ) โ ๐ / ( 1.5 โ ๐ โ ๐ ~ ) ๐ ~ + ๐ ) 2 ) โค
with ๐ โ [ 1.5 โ ๐ โ ๐ ~ ] .
In the adaptive setting, pulling one arm can provide information about the distributions of other arms. The optimal policy in this setting should adaptively refine its sampling and stopping strategy based on historical data. This allows the algorithm to focus more on the disturbing arms, making adaptive strategies particularly effective as the algorithm progresses. In our experiments, we set ๐
4 , ๐
1 , with ๐
10 and ๐ โ { 0.1 , 0.2 , 0.3 } . A total of six different problem instances are evaluated to compare the performance of the algorithms.
EC.12.2Synthetic II - Static Setting.
We consider a static synthetic setting, similar to the one proposed by Xu et al. (2018), where arms are represented as ๐ -dimensional canonical basis vectors ๐ 1 , ๐ 2 , โฆ , ๐ ๐ . We set the parameter vector ๐ฝ
( ฮ , โฆ , ฮ , 0 , โฆ , 0 ) โค , where ๐ ~ elements are ฮ and ๐ โ ๐ ~ elements are 0. In this setting, ๐ผ โ [ ๐ ~ ]
๐ , and only the value ๐ is provided as input to top ๐ algorithms. Consequently, the true mean values consist of some ฮ โs and some 0 โs.
If we set ๐
ฮ / 2 , as ฮ approaches 0, it becomes difficult to distinguish between the ๐ -best arms and the arms that are not ๐ -best. In the static setting, knowledge of the rewards does not alter the sampling strategy, as all arms must be estimated with equal accuracy to effectively differentiate between them. Therefore, a static policy is optimal in this case, and the goal of this setting is to assess the ability of our algorithm to adapt to such static conditions. In our experiment, we set ๐
4 with ๐ โ { 8 , 12 , 16 } and ฮ
1 . A total of three different problem instances are evaluated to compare the algorithms.
Appendix EC.13Auxiliary Results
The following lemma shows that matrix inversion reverses the order relation.
Lemma EC.1
(Inversion Reverses Loewner orders) Let ๐ , ๐ โ โ ๐ ร ๐ . Suppose that ๐ โชฐ ๐ and ๐ is invertible, we have
๐จ โ 1 โชฏ ๐ฉ โ 1 .
(EC.13.1) Proof EC.2
Proof. By definition, to show ๐ โ 1 โ ๐ โ 1 is a positive semi-definite matrix, it suffices to show that โ ๐ฑ โ ๐ โ 1 2 โ โ ๐ฑ โ ๐ โ 1 2
โ ๐ฑ โ ๐ โ 1 โ ๐ โ 1 2 โฅ 0 for any ๐ฑ โ โ ๐ . Then, by the Cauchy-Schwarz inequality,
โ ๐ โ ๐จ โ 1 2
โจ ๐ , ๐จ โ 1 โ ๐ โฉ โค โ ๐ โ ๐ฉ โ 1 โ โ ๐จ โ 1 โ ๐ โ ๐ฉ โค โ ๐ โ ๐ฉ โ 1 โ โ ๐จ โ 1 โ ๐ โ ๐จ
โ ๐ โ ๐ฉ โ 1 โ โ ๐ โ ๐จ โ 1 .
(EC.13.2)
Hence โ ๐ฑ โ ๐ โ 1 โค โ ๐ฑ โ ๐ โ 1 for all ๐ฑ , which completes the lemma. \Halmos
The following lemma establishes an upper bound on the ratio between two optimization problems that incorporate instance-specific information from the bandit setting.
Lemma EC.3
We always have ๐ ๐ผ
[ ๐พ ] , i.e., the entire set of arms is under consideration. For any arm ๐ โ ๐ ๐ผ โฉ ๐บ ๐ โ { 1 } , we have
min
๐
โ
๐
๐พ
โก
max
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
โค
๐
โ
๐ฟ
1
๐ข
๐ด
2
โ
๐ฟ
2
.
(EC.13.3) Proof EC.4
Proof. For any arm ๐ โ [ ๐พ ] , from a perspective of geometry quantity, let conv( ๐ โช โ ๐ ) denote the convex hull of symmetric ๐ โช โ ๐ . Then for any set ๐ด โ โ ๐ define the gauge of ๐ด as
๐ข ๐ด
max { ๐
0 : ๐ ๐ด โ conv( ๐ โช โ ๐ ) } .
(EC.13.4)
We then provide a natural upper bound for
min
๐ฉ
โ
๐
๐พ
โก
max
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐ฒ
1
,
๐
โ
๐
๐ฉ
โ
1
2
, given by
min
๐
โ
๐
๐พ
โก
max
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
โค
min
๐
โ
๐
๐พ
โก
max
๐ฆ
โ
๐ด
โ
(
๐
๐ผ
)
โก
โ
๐
โ
๐ฝ
๐
โ
1
2
1
๐ข
๐ด
2
โ
min
๐
โ
๐
๐พ
โก
max
๐ฆ
โ
๐ด
โ
(
๐
๐ผ
)
โก
โ
๐
โ
๐ข
๐ด
โ
๐ฝ
๐
โ
1
2
โค
1
๐ข
๐ด
2
โ
min
๐
โ
๐
๐พ
โก
max
๐
โ
conv(
โ
๐
โช
โฃ
โ
๐
โ
)
โก
โ
๐
โ
๐ฝ
๐
โ
1
2
1 ๐ข ๐ด 2 โ min ๐ โ ๐ ๐พ โก max ๐ โ ๐ ๐ผ โก โ ๐ ๐ โ ๐ฝ ๐ โ 1 2
โค ๐ ๐ข ๐ด 2 ,
(EC.13.5)
where the third line follows from the fact that the maximum value of a convex function on a convex set must occur at a vertex. With the Kiefer-Wolfowitz Theorem for the G-optimal design, the last inequality is achieved.
Furthermore, for any arm
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
, we have
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
โฅ
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
eig
min
โ
(
๐ฝ
๐
โ
1
)
โ
โ
๐
1
,
๐
โ
2
2
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
1
eig
max
โ
(
๐ฝ
๐
)
โ
โ
๐
1
,
๐
โ
2
2
โฅ
1
max
๐
โ
๐
๐ผ
โก
โ
๐
๐
โ
2
โ
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
2
2
,
(EC.13.6)
where
eig
max
โ
(
โ
)
and
eig
min
โ
(
โ
)
are respectively the largest and smallest eigenvalues of a matrix. The first line follows from the Rayleigh Quotient and Rayleigh Theorem. The last line is derived by the relationship
eig
max
โ
(
๐
๐ฉ
)
โค
max
๐
โ
๐
๐ผ
โก
โ
๐
๐
โ
2
. Recall the assumption in Theorem 4.2 that
min
๐
โ
๐บ
๐
{
1
}
โก
โ
๐
1
โ
๐
๐
โ
2
โฅ
๐ฟ
2
and the assumption in Section 2 that
โ
๐
๐
โ
2
โค
๐ฟ
1
โ
for
โ
โ
๐
โ
[
๐พ
]
, we have
min
๐
โ
๐
๐พ
โก
min
๐
โ
๐
๐ผ
โฉ
๐บ
๐
{
1
}
โก
โ
๐
1
,
๐
โ
๐ฝ
๐
โ
1
2
โฅ
๐ฟ
2
๐ฟ
1
.
(EC.13.7)
Finally, combining inequalities (EC.4) and (EC.13.7) completes the lemma. \Halmos
Lemma EC.5 (Kiefer and Wolfowitz (1960))
If the arm vectors ๐ โ ๐ span โ ๐ , then for any probability distribution ๐ โ ๐ซ โ ( ๐ ) , the following statements are equivalent:
1.
๐ โ minimizes the function ๐ โ ( ๐ )
max ๐ โ ๐ โก โ ๐ โ ๐ฝ โ ( ๐ ) โ 1 2 .
2.
๐ โ maximizes the function ๐ โ ( ๐ )
log โ det ๐ฝ โ ( ๐ ) .
3.
๐ โ ( ๐ โ )
๐ .
Additionally, there exists a ๐ โ of ๐ โ ( ๐ ) such that the size of its support, | Supp โ ( ๐ โ ) | , is at most ๐ โ ( ๐ + 1 ) / 2 .
\c@NAT@ctr Appendix References Abbasi-Yadkori et al. (2011) โ Abbasi-Yadkori Y, Pรกl D, Szepesvรกri C (2011) Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. Abe and Long (1999) โ Abe N, Long PM (1999) Associative reinforcement learning using linear probabilistic concepts. ICML, 3โ11 (Citeseer). Ahn and Shin (2020) โ Ahn D, Shin D (2020) Ordinal optimization with generalized linear model. 2020 Winter Simulation Conference (WSC), 3008โ3019 (IEEE). Ahn et al. (2024) โ Ahn D, Shin D, Zeevi A (2024) Feature misspecification in sequential learning problems. Management Science 0(0):null. Azizi et al. (2021a) โ Azizi MJ, Kveton B, Ghavamzadeh M (2021a) Fixed-budget best-arm identification in contextual bandits: A static-adaptive algorithm. CoRR abs/2106.04763. Azizi et al. (2021b) โ Azizi MJ, Kveton B, Ghavamzadeh M (2021b) Fixed-budget best-arm identification in structured bandits. arXiv preprint arXiv:2106.04763 . Chaloner and Verdinelli (1995) โ Chaloner K, Verdinelli I (1995) Bayesian experimental design: A review. Statistical science 273โ304. Chapelle and Li (2011) โ Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. Advances in neural information processing systems 24. Fiez et al. (2019) โ Fiez T, Jain L, Jamieson KG, Ratliff L (2019) Sequential experimental design for transductive linear bandits. Advances in neural information processing systems 32. Filippi et al. (2010) โ Filippi S, Cappe O, Garivier A, Szepesvรกri C (2010) Parametric bandits: The generalized linear case. Advances in neural information processing systems 23. Gabillon et al. (2012) โ Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems 25. Garivier and Kaufmann (2016) โ Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Conference on Learning Theory, 998โ1027 (PMLR). Ghosh et al. (2017) โ Ghosh A, Chowdhury SR, Gopalan A (2017) Misspecified linear bandits. Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Hoffman et al. (2014) โ Hoffman M, Shahriari B, Freitas N (2014) On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. Artificial Intelligence and Statistics, 365โ374 (PMLR). Kaufmann et al. (2016) โ Kaufmann E, Cappรฉ O, Garivier A (2016) On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research 17:1โ42. Kaufmann and Koolen (2021) โ Kaufmann E, Koolen WM (2021) Mixture martingales revisited with applications to sequential tests and confidence intervals. The Journal of Machine Learning Research 22(1):11140โ11183. Kiefer and Wolfowitz (1960) โ Kiefer J, Wolfowitz J (1960) The equivalence of two extremum problems. Canadian Journal of Mathematics 12:363โ366. Kveton et al. (2023) โ Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2023) Randomized exploration in generalized linear bandits. Lattimore and Szepesvรกri (2020) โ Lattimore T, Szepesvรกri C (2020) Bandit algorithms (Cambridge University Press). Lattimore et al. (2020) โ Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Li et al. (2010) โ Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web, 661โ670, WWW โ10 (New York, NY, USA: Association for Computing Machinery), ISBN 9781605587998. McCullagh (2019) โ McCullagh P (2019) Generalized linear models (Routledge). Qin and You (2025) โ Qin C, You W (2025) Dual-directed algorithm design for efficient pure exploration. Operations Research . Rรฉda et al. (2021) โ Rรฉda C, Tirinzoni A, Degenne R (2021) Dealing with misspecification in fixed-confidence linear top-m identification. Advances in Neural Information Processing Systems 34:25489โ25501. Russo et al. (2018) โ Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z, et al. (2018) A tutorial on thompson sampling. Foundations and Trendsยฎ in Machine Learning 11(1):1โ96. Soare et al. (2014) โ Soare M, Lazaric A, Munos R (2014) Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27. Wang et al. (2021) โ Wang PA, Tzeng RC, Proutiere A (2021) Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems 34:5810โ5821. Xu et al. (2018) โ Xu L, Honda J, Sugiyama M (2018) A fully adaptive algorithm for pure exploration in linear bandits. International Conference on Artificial Intelligence and Statistics, 843โ851 (PMLR). Yang and Tan (2021) โ Yang J, Tan VY (2021) Minimax optimal fixed-budget best arm identification in linear bandits. arXiv preprint arXiv:2105.13017 . Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Xet Storage Details
- Size:
- 240 kB
- Xet hash:
- 27f6055662c6406f7b10b036b982959717edc4af6fb7d56e69b2817bbb87f49b
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.