huggingchat/papers-content / 2510 /2510.00073.md

|

240 kB

Title: 1 Introduction

URL Source: https://arxiv.org/html/2510.00073

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Problem Formulation 3Lower Bound and Problem Complexity 4Algorithm and Upper Bound 5Model Misspecification 6Generalized Linear Model 7Numerical Experiments 8Conclusion References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: informs3.cls

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2510.00073v1 [stat.ML] 29 Sep 2025 \OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\NAT@set@cites \RUNTITLE

Identifying All 𝜀 -Best Arms In Linear Bandits

\TITLE

Identifying All 𝜀 -Best Arms in (Misspecified) Linear Bandits

\ARTICLEAUTHORS\AUTHOR

Zhekai Li \AFFDepartment of Civil and Environmental Engineering, Massachusetts Institute of Technology, \EMAILzhekaili@mit.edu \AUTHORTianyi Ma \AFFSchool of Operations Research and Information Engineering, Cornell University, \EMAILtm693@cornell.edu \AUTHORCheng Hua \AFFAntai College of Economics and Management, Shanghai Jiao Tong University, \EMAILcheng.hua@sjtu.edu.cn \AUTHORRuihao Zhu \AFFSC Johnson College of Business, Cornell University, \EMAILruihao.zhu@cornell.edu

\ABSTRACT

Motivated by the need to efficiently identify multiple candidates in high trial-and-error cost tasks such as drug discovery, we propose a near-optimal algorithm to identify all 𝜀 -best arms (i.e., those at most 𝜀 worse than the optimum). Specifically, we introduce LinFACT, an algorithm designed to optimize the identification of all 𝜀 -best arms in linear bandits. We establish a novel information-theoretic lower bound on the sample complexity of this problem and demonstrate that LinFACT achieves instance optimality by matching this lower bound up to a logarithmic factor. A key ingredient of our proof is to integrate the lower bound directly into the scaling process for upper bound derivation, determining the termination round and thus the sample complexity. We also extend our analysis to settings with model misspecification and generalized linear models. Numerical experiments, including synthetic and real drug discovery data, demonstrate that LinFACT identifies more promising candidates with reduced sample complexity, offering significant computational efficiency and accelerating early-stage exploratory experiments.

\KEYWORDS

ranking and selection; sequential decision making; simulation; adaptive experiment; model misspecification

1Introduction

This paper addresses the problem of identifying the best set of options from a finite pool of candidates. A decision-maker sequentially selects candidates for evaluation, observing independent noisy rewards that reflect their quality. The goal is to strategically allocate measurement efforts to identify the desired candidates. This problem belongs to the class of pure exploration problems, which fall under the bandit framework but differ from traditional multi-armed bandits (MABs) that balance exploration and exploitation to minimize cumulative regret. Instead, pure exploration focuses on efficient information gathering to confidently apply the chosen best or best set of options. This approach is particularly relevant in applications such as drug discovery and product testing, where identifying the most promising candidates is followed by utilizing them under high-cost conditions, such as clinical trials or large-scale manufacturing tests.

Conventional pure exploration focuses on identifying the optimal candidate, often referred to as the best arm in the bandit setting. However, in many real-world scenarios, candidates with rewards falling slightly below the optimum may later demonstrate advantageous traits, such as fewer side effects, simpler manufacturing processes, or lower resistance during implementation. Motivated by this insight, this paper aims to identify all 𝜀 -best candidates (i.e., those whose performance is at most 𝜀 worse than the optimum). This approach is especially valuable when exploring a range of nearly optimal options is necessary. Promoting multiple promising candidates not only mitigates risk but also increases the chances that at least one will prove successful.

This setting captures many applications in real-world scenarios. For example:

•

Drug Discovery: In drug discovery, pharmaceutical companies aim to identify as many promising drug candidates as possible during preclinical stages. These candidates, known as preclinical candidate compounds (PCCs), are optimized compounds prepared for preclinical testing to assess efficacy, safety, and pharmacokinetics before advancing to clinical trials. Given the inherent risks and high failure rates in subsequent drug development (Das et al. 2021), starting with a larger pool of potential candidates increases the chances of identifying at least one successful, marketable drug.

•

Assortment Design: In e-commerce (Boyd and Bilegan 2003, Elmaghraby and Keskinocak 2003, Feng et al. 2022), recommender systems (Huang et al. 2007, Peukert et al. 2023), and streaming services (Alaei et al. 2022, Godinho de Matos et al. 2018), expanding the consideration set (e.g., products, movies, or songs) can improve user satisfaction and increase revenues. Offering a diverse range of recommendations helps cater to varying tastes and preferences.

•

Automatic Machine Learning: In automatic machine learning (AutoML) (Thornton et al. 2013), the goal is to automate the process of selecting algorithms and tuning hyperparameters by providing multiple promising choices for predictive tasks. Due to the randomness of limited data, the best out-of-sample model may not always be optimal. Therefore, providing users with a diverse set of models that yield good results is critical to assisting them in selecting the best algorithm and hyperparameters.

1.1Main Contributions

This paper focuses on identifying all 𝜀 -best arms in the linear bandit setting and presents contributions in both the algorithmic and theoretical dimensions.

•

𝛿 -Probably Approximately Correct (PAC) Algorithm: On the algorithmic front, we introduce LinFACT (Linear Fast Arm Classification with Threshold estimation), a 𝛿 -PAC (see Section 2.1 for a formal definition) phase-based algorithm for identifying all 𝜀 -best arms in linear bandit problems. LinFACT demonstrates superior effectiveness compared to existing pure exploration algorithms.

•

Matching Bound of Sample Complexity: We make two key technical contributions. First, we derive an information-theoretic lower bound on the problem complexity for identifying all 𝜀 -best arms in linear bandits. To the best of our knowledge, this is the first such result in the literature. Second, we establish two distinct upper bounds on the expected sample complexity of LinFACT, illustrating the differences between various optimal design criteria used in its implementation. Notably, we demonstrate that LinFACT achieves instance optimality when using the 𝒳 𝒴 -optimal design criterion, matching the lower bound up to a logarithmic factor. The 𝒳 𝒴 -optimal design focuses on contrasting pairs of arms rather than evaluating each arm individually. Our analysis leverages the lower bound directly in defining the classification termination round and in scaling the upper bound.

•

Accounting for Misspecified Models and GLMs: We extend our framework beyond linear models to handle misspecified linear bandits and generalized linear models (GLMs). For both cases, we provide theoretical upper bounds on the expected sample complexity. Furthermore, we analyze how prior knowledge of model misspecification impacts the algorithmic upper bounds and performance, and how the incorporation of GLMs influences the sample complexity.

•

Numerical Studies with Real-World Datasets: We conduct extensive numerical experiments to demonstrate that our LinFACT algorithm outperforms existing methods in terms of sample complexity, computational efficiency, and reliable identification of all 𝜀 -best arms. In experiments with synthetic data, LinFACT outperforms other baselines in both adaptive and static settings. Using a real-world drug discovery dataset (Free and Wilson 1964), we further show that LinFACT achieves superior performance compared to previous algorithms. Notably, LinFACT is computationally efficient with time complexity of 𝑂 ( 𝐾 𝑑 2 ) , which is lower than 𝑂 ( 𝑛 𝐾 𝑑 2 ) of Lazy TTS ( 𝑛 is a non-negligible number) (Rivera and Tewari 2024), 𝑂 ( 𝐾 𝑑 3 ) of top 𝑚 algorithms (Réda et al. 2021a), and 𝑂 ( 𝐾 2 log ⁡ 𝐾 ) of KGCB (Negoescu et al. 2011).

1.2Related Literature

Pure Exploration. The multi-armed bandits (MABs) model has been a critical paradigm for addressing the exploration-exploitation trade-off since its introduction by Thompson (1933) in the context of medical trials. While much of the research focuses on minimizing cumulative regret (Bubeck et al. 2012, Lattimore and Szepesvári 2020), our work focuses on the pure exploration setting (Koenig and Law 1985), where the goal is to select a subset of arms and evaluation is based on the final outcome. This distinction highlights the context-specific benefits of each approach: MABs are suited for tasks where the goal is to optimize rewards in real-time, balancing exploration and exploitation, whereas pure exploration is focused on identifying a set of satisfactory arms, without the immediate concern for reward maximization. In pure exploration, the algorithm prioritizes information gathering over reward collection, transforming the objective from reward-centric to information-centric. The focus is on efficiently acquiring sufficient information about all arms for confident identification.

The origins of pure exploration problems date back to the 1950s in the context of stochastic simulation, specifically within ordinal optimization (Shin et al. 2018, Shen et al. 2021) or the Ranking and Selection (R&S) problem, first addressed by Bechhofer (1954). Various methodologies have since been developed to solve the canonical R&S problem, including elimination-type algorithms (Kim and Nelson 2001, Bubeck et al. 2013, Fan et al. 2016), Optimal Computing Budget Allocation (OCBA) (Chen et al. 2000), knowledge-gradient algorithms (Frazier et al. 2008, 2009, Ryzhov et al. 2012), UCB-type algorithms (Kaufmann and Kalyanakrishnan 2013), and the unified gap-based exploration (UGapE) algorithm (Gabillon et al. 2012). Comprehensive reviews of the R&S problem can be found in Kim and Nelson (2006), Hong et al. (2021), with the most recent overview provided by Li et al. (2024).

The general framework of pure exploration encompasses various exploration tasks (Qin and You 2025), including best arm identification (BAI) (Mannor and Tsitsiklis 2004, Even-Dar et al. 2006, Russo 2020, Komiyama et al. 2023, Simchi-Levi et al. 2024), top 𝑚 identification (Kalyanakrishnan and Stone 2010, Kalyanakrishnan et al. 2012), threshold bandits (Locatelli et al. 2016, Abernethy et al. 2016), and satisficing bandits (Feng et al. 2025). In applications such as drug discovery, pharmacologists aim to identify a set of highly potent drug candidates from potentially millions of compounds, with only the selected candidates advancing to more extensive testing. Given the uncertainty of final outcomes and the high cost of trial-and-error, identifying multiple promising candidates simultaneously is crucial. To minimize the cost of early-stage exploration, adaptive, sequential experimental designs are necessary, as they require fewer experiments compared to fixed designs.

All 𝜀 -Best Arms Identification. Conventional objectives, such as identifying the top 𝑚 best arms or all arms above a certain threshold, often face significant challenges. In the top 𝑚 task, selecting a small 𝑚 may exclude promising candidates, while choosing a large 𝑚 may include ineffective options, requiring an impractically large number of experiments. Similarly, setting a threshold too high can exclude viable candidates. Both approaches depend on prior knowledge of the problem to achieve good performance, which may not be available in real-world applications.

In contrast, identifying all 𝜀 -best arms (those within 𝜀 of the best arm) overcomes these limitations. This approach promotes broader exploration while providing a robust guarantee: no significantly suboptimal arms will be selected, thereby improving the reliability of downstream tasks (Mason et al. 2020). The all 𝜀 -best arms identification problem generalizes both the top 𝑚 and threshold bandit problems. It reduces to the top 𝑚 problem if the number of 𝜀 -best arms is known in advance, and to a threshold bandit problem if the value of the best arm is known.

Mason et al. (2020) introduced the problem complexity for identifying all 𝜀 -best arms and derived a lower bound in the low-confidence regime. However, their lower bound involves a summation that may be unnecessary, indicating room for improvement. Building on Mason’s work, Al Marjani et al. (2022) derived tighter lower bounds by fully characterizing the alternative bandit instances that an optimal sampling strategy must distinguish and eliminate. They also proposed the asymptotically optimal Track-and-Stop algorithm. However, both Mason et al. (2020) and Al Marjani et al. (2022) consider stochastic bandits without structures. In contrast, we study this problem in the linear bandit setting (Abe and Long 1999), which leverages structural relationships among arms. This presents new challenges, but also allows it to handle more complex scenarios. As a result, our work establishes the first information-theoretic lower bound for identifying all 𝜀 -best arms in linear bandits, applicable to any 𝛿 -PAC algorithm.

An extended literature review of misspecified linear bandits and generalized linear bandit models can be found in Section EC.1 of the online appendix.

2Problem Formulation

Notations. In this paper, we denote the set of positive integers up to 𝑁 by [ 𝑁 ]

{ 1 , … , 𝑁 } . Vectors and matrices are represented using boldface notation. The inner product of two vectors is denoted by ⟨ ⋅ , ⋅ ⟩ . We define the weighted matrix norm ‖ 𝒙 ‖ 𝑨 as 𝒙 ⊤ 𝑨 𝒙 , where 𝑨 is a positive semi-definite matrix that weights and scales the norm. For two probability measures 𝑃 and 𝑄 over a common measurable space, if 𝑃 is absolutely continuous with respect to 𝑄 , the Kullback-Leibler divergence between 𝑃 and 𝑄 is defined as

KL ( 𝑃 , 𝑄 )

{ ∫ log ⁡ ( 𝑑 𝑃 𝑑 𝑄 ) 𝑑 𝑃 ,

if 𝑄 ≪ 𝑃 ;

∞ ,

otherwise ,

(1)

where 𝑑 𝑃 𝑑 𝑄 is the Radon-Nikodym derivative of 𝑃 with respect to 𝑄 , and 𝑄 ≪ 𝑃 indicates that 𝑄 is absolutely continuous with respect to 𝑃 .

Setting. We address the problem of identifying all 𝜀 -best arms from a finite set of 𝐾 arms where 𝐾 is a (possibly large) positive integer. Each arm 𝑖 ∈ [ 𝐾 ] has an associated reward distribution with an unknown fixed mean 𝜇 𝑖 . Let the mean vector of all arms be denoted as 𝝁

( 𝜇 1 , 𝜇 2 , … , 𝜇 𝐾 ) , which can only be estimated through bandit feedback from the selected arms. Without loss of generality, we assume 𝜇 1 > 𝜇 2 ≥ … ≥ 𝜇 𝐾 . The gap Δ 𝑖

𝜇 1 − 𝜇 𝑖 (for 𝑖 ≠ 1 ) represents the difference in expected rewards between the optimal arm and arm 𝑖 . To this end, we give a formal definition of the notion of 𝜀 -best arm.

Definition 2.1

( 𝜀 -Best Arm). Given 𝜀

0 , an arm 𝑖 is called 𝜀 -best if 𝜇 𝑖 ≥ 𝜇 1 − 𝜀 .

Here, we adopt an additive framework to define 𝜀 -best arms. There also exists a multiplicative counterpart, where an arm 𝑖 is considered 𝜀 -best if 𝜇 𝑖 ≥ ( 1 − 𝜀 ) 𝜇 1 . While our study focuses on the additive model, the analysis for the multiplicative model follows similar reasoning. We denote the set of all 𝜀 -best arms1 for a mean vector 𝝁 as

𝐺 𝜀 ( 𝝁 ) ≔ { 𝑖 : 𝜇 𝑖 ≥ 𝜇 1 − 𝜀 } .

(2)

Define 𝛼 𝜀 ≔ min 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) ⁡ ( 𝜇 𝑖 − ( 𝜇 1 − 𝜀 ) ) as the distance from the smallest 𝜀 -best arm to the threshold 𝜇 1 − 𝜀 . Furthermore, if the complement of 𝐺 𝜀 ( 𝝁 ) , denoted as 𝐺 𝜀 𝑐 ( 𝝁 ) , is non-empty, we define 𝛽 𝜀 ≔ min 𝑖 ∈ 𝐺 𝜀 𝑐 ( 𝝁 ) ⁡ ( ( 𝜇 1 − 𝜀 ) − 𝜇 𝑖 ) as the closest distance from the threshold to the highest mean value of any arm that is not considered 𝜀 -best.

We study this problem under a linear structure, where the mean values depend on an unknown parameter vector 𝜽 ∈ ℝ 𝑑 . Each arm 𝑖 is associated with a feature vector 𝒂 𝑖 ∈ ℝ 𝑑 . Let 𝒜 ⊂ ℝ 𝑑 be the set of feature vectors, and let 𝚿 ≔ [ 𝒂 1 , 𝒂 2 , … , 𝒂 𝐾 ] ∈ ℝ 𝐾 × 𝑑 be the feature matrix. With parameter 𝜽 , the mean value can be represented as 𝝁 ( 𝜽 ) . For simplicity, we will consistently refer to the bandit instance as 𝝁 . When an arm 𝐴 𝑡 with corresponding feature vector 𝒂 𝐴 𝑡 ∈ 𝒜 is selected at time 𝑡 , we observe the bandit feedback 𝑋 𝑡 , given by

𝑋 𝑡

𝒂 𝐴 𝑡 ⊤ 𝜽 + 𝜂 𝑡 ,

(3)

where 𝜇 𝐴 𝑡

𝒂 𝐴 𝑡 ⊤ 𝜽 is the true mean reward of the selected arm, and 𝜂 𝑡 is a noise variable. We also make two additional standard assumptions on the norm of the parameters and the noise distributions (Abbasi-Yadkori et al. 2011). {assumption} Assume max 𝑖 ∈ [ 𝐾 ] ⁡ ‖ 𝒂 𝑖 ‖ 2 ≤ 𝐿 1 , where ∥ ⋅ ∥ 2 denotes the ℓ 2 -norm and 𝐿 1 is a constant.

{assumption}

The noise 𝜂 𝑡 is conditionally 1-sub-Gaussian, i.e., for any 𝜆 ∈ ℝ ,

𝔼 [ 𝑒 𝜆 𝜂 𝑡 | 𝒂 𝐴 1 , … , 𝒂 𝐴 𝑡 − 1 , 𝜂 1 , … , 𝜂 𝑡 − 1 ] ≤ exp ⁡ ( 𝜆 2 2 ) .

(4) 2.1Probably Approximately Correct Algorithm Framework

Our goal is to identify all 𝜀 -best arms with high confidence while minimizing the sampling budget. To achieve this, we employ three main components: stopping rule, sampling rule, and decision rule.

At each time step 𝑡 , the stopping rule 𝜏 𝛿 determines whether to continue or stop the process. If the process continues, an arm is selected according to the sampling rule, and the corresponding random reward is observed. When the process stops at 𝑡

𝜏 𝛿 , a decision rule provides an estimate ℐ ^ 𝜏 𝛿 of the true solution set ℐ ( 𝝁 ) , which in our problem is the set of all 𝜀 -best arms, 𝐺 𝜀 ( 𝝁 ) .

We define the set of all viable mean vectors 𝝁 as

𝑀 ≔ { 𝝁 ∈ ℝ 𝐾 | ∃ 𝜽 ∈ ℝ 𝑑 , 𝝁

𝚿 𝜽 ∧ ‖ 𝒂 𝑖 ‖ 2 ≤ 𝐿 1 for each 𝑖 ∈ [ 𝐾 ] } .

(5)

Here, the set 𝑀 consists of all possible mean vectors 𝝁 that can be expressed as a linear combination of the parameter vector 𝜽 through the matrix 𝚿 .

We focus on algorithms that are probably approximately correct with high confidence, referred to as 𝛿 -PAC algorithms.

Definition 2.2

( 𝛿 -PAC Algorithm). An algorithm is 𝛿 -PAC for all 𝜀 -best arms identification if it identifies the correct solution set with a probability of at least 1 − 𝛿 for any problem instance with mean 𝛍 ∈ 𝑀 , i.e.,

ℙ 𝝁 ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

𝐺 𝜀 ( 𝝁 ) ) ≥ 1 − 𝛿 , ∀ 𝝁 ∈ 𝑀 .

(6)

Upon stopping, 𝛿 -PAC algorithms ensure the identification of all 𝜀 -best arms with high confidence. Therefore, our goal is to design a 𝛿 -PAC algorithm that minimizes the stopping time, formulated as the following optimization problem.

min

𝔼 𝝁 [ 𝜏 𝛿 ]

(7)

s.t.

ℙ 𝝁 ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

𝐺 𝜀 ( 𝝁 ) ) ≥ 1 − 𝛿 , ∀ 𝝁 ∈ 𝑀 .

(8) 2.2Optimal Design of Experiment

Linear bandit algorithms can be viewed as an online, adaptive counterpart to the classical optimal design problem. This section develops the key theoretical foundations, emphasizing the confidence bounds for parameter estimation that guide both the design of our algorithm and the subsequent analysis.

Ordinary Least Squares. Consider a sequence of pulled arms, denoted as 𝐴 1 , 𝐴 2 , … , 𝐴 𝑡 , and the corresponding observed rewards 𝑋 1 , 𝑋 2 , … , 𝑋 𝑡 . If the feature vectors of these arms, 𝒂 𝐴 1 , 𝒂 𝐴 2 , … , 𝒂 𝐴 𝑡 , span the space ℝ 𝑑 , the ordinary least squares (OLS) estimator for the parameter 𝜽 is given by

𝜽 ^ 𝑡

𝑽 𝑡 − 1 ∑ 𝑛

1 𝑡 𝒂 𝐴 𝑛 𝑋 𝑛 ,

(9)

where 𝑽 𝑡

∑ 𝑛

1 𝑡 𝒂 𝐴 𝑛 𝒂 𝐴 𝑛 ⊤ ∈ ℝ 𝑑 × 𝑑 represents the information matrix. Using the properties of sub-Gaussian random variables, we can derive a confidence bound for the OLS estimator. This bound, denoted as 𝐵 𝑡 , 𝛿 , is detailed in Proposition 2.3. The confidence region for the parameter 𝜽 at time step 𝑡 is given by

𝒞 𝑡 , 𝛿

{ 𝜽 : ‖ 𝜽 ^ 𝑡 − 𝜽 ‖ 𝑽 𝑡 ≤ 𝐵 𝑡 , 𝛿 } .

(10) Proposition 2.3 (Lattimore and Szepesvári (2020))

For any fixed sampling policy and any given vector 𝐱 ∈ ℝ 𝑑 , with probability at least 1 − 𝛿 , the following holds.

| 𝒙 ⊤ ( 𝜽 ^ 𝑡 − 𝜽 ) | ≤ ‖ 𝒙 ‖ 𝑽 𝑡 − 1 𝐵 𝑡 , 𝛿 ,

(11)

where the anytime confidence bound 𝐵 𝑡 , 𝛿 is given by 𝐵 𝑡 , 𝛿

2 2 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 1 𝛿 ) ) .

In many practical scenarios, the observed data are not predetermined. To handle this, a martingale-based method can be employed, as described by Abbasi-Yadkori et al. (2011), to define an adaptive confidence bound for the OLS estimator. This accounts for the variability introduced by random rewards and adaptive sampling policies. The confidence interval in Proposition 2.3 highlights the connection between arm allocation policies in linear bandits and experimental design theory (Pukelsheim 2006). This connection serves as a fundamental component in constructing our algorithm.

Feature Vector Projection. At any time step where estimation needs to be made after sampling, if the feature vectors of the sampled arms do not span ℝ 𝑑 , we substitute them with dimensionality-reduced feature vectors (Yang and Tan 2022). Specifically, we project all feature vectors onto the subspace spanned by 𝒜 . Let 𝑩 ∈ ℝ 𝑑 × 𝑑 ′ be an orthonormal basis for this subspace, where 𝑑 ′ < 𝑑 is the dimension of the subspace. The new feature vector 𝒂 ′ is then given by 𝒂 ′

𝑩 ⊤ 𝒂 . In this transformation, 𝑩 𝑩 ⊤ is a projection matrix, ensuring

⟨ 𝜽 , 𝒂 ⟩

⟨ 𝜽 , 𝑩 𝑩 ⊤ 𝒂 ⟩

⟨ 𝑩 ⊤ 𝜽 , 𝑩 ⊤ 𝒂 ⟩

⟨ 𝜽 ′ , 𝒂 ′ ⟩ .

(12)

Equation (12) ensures that the mean values of all arms remain unchanged under the projection. The first equality holds because 𝑩 𝑩 ⊤ is a projection matrix and 𝒂 lies in the subspace spanned by 𝑩 (i.e., 𝑩 𝑩 ⊤ 𝒂

𝒂 ). The second equality follows from the same matrix form 𝜽 ⊤ 𝑩 𝑩 ⊤ 𝒂 . The third equality holds by definition.

Optimal Design Criteria. In contrast to stochastic bandits, where the mean values of the arms are estimated through repeated sampling of each arm, the linear bandit setting allows these values to be inferred from accurate estimation of the underlying parameter vector 𝜽 . As a result, pulling a single arm provides information about all arms.

A key sampling strategy in this context is the G-optimal design, which minimizes the maximum variance of the predicted responses across all arms by optimizing the fraction of times each arm is selected. Formally, the G-optimal design problem seeks a probability distribution 𝜋 on 𝒜 , where 𝜋 : 𝒜 → [ 0 , 1 ] and ∑ 𝒂 ∈ 𝒜 𝜋 ( 𝒂 )

1 , that minimizes

𝑔 ( 𝜋 )

max 𝒂 ∈ 𝒜 ⁡ ‖ 𝒂 ‖ 𝑽 ( 𝜋 ) − 1 2 ,

(13)

where 𝑽 ( 𝜋 )

∑ 𝒂 ∈ 𝒜 𝜋 ( 𝒂 ) 𝒂 𝒂 ⊤ is the weighted information matrix, analogous to 𝑽 𝑡 in equation (9). The G-optimal design (13) ensures a tight confidence interval for mean value estimation. However, comparing the relative differences of mean values across different arms is more critical in identifying the best arms, rather than making the best estimation.

Therefore, we consider an alternative design criterion, the 𝒳 𝒴 -optimal design, that directly targets the estimation of these gaps. Consider 𝒮 ⊆ 𝒜 as a subset of the arm space. We define

𝒴 ( 𝒮 ) ≔ { 𝒂 − 𝒂 ′ : ∀ 𝒂 , 𝒂 ′ ∈ 𝒮 , 𝒂 ≠ 𝒂 ′ }

(14)

as the set of vectors representing the differences between each pair of arms in 𝒮 . The 𝒳 𝒴 -optimal design minimizes

𝑔 𝒳 𝒴 ( 𝜋 )

max 𝒚 ∈ 𝒴 ( 𝒜 ) ⁡ ‖ 𝒚 ‖ 𝑽 ( 𝜋 ) − 1 2 .

(15)

As mentioned previously, the 𝒳 𝒴 -optimal design focuses on minimizing the maximum variance when estimating the differences (gaps) between pairs of arms. By doing so, it ensures differentiation between arms, rather than estimating each arm individually. This criterion is particularly useful when the goal is to identify relative performance rather than absolute quality.

3Lower Bound and Problem Complexity

In this section, we present a novel information-theoretic lower bound for the problem of identifying all 𝜀 -best arms in linear bandits. Building on the approach of Soare et al. (2014), we extend the lower bound for best arm identification (BAI) to this more general setting. Figure 1 visualizes the structure of the stopping condition, with additional graphical insights provided in Section EC.4.1. These visualizations offer geometric intuition for the challenges involved in identifying all 𝜀 -best arms in linear bandits.

Figure 1:Illustration of the Stopping Condition: Best Arm Identification vs. All 𝜀 -Best Arms Identification

Note. (a) Stopping occurs when the confidence region 𝒞 𝑡 , 𝛿 for the estimated parameter 𝛉 ^ 𝑡 contracts entirely within one of the three decision regions 𝑀 𝑖 in a certain time step 𝑡 . The boundaries between regions are defined by the hyperplanes 𝜗 ⊤ ( 𝐚 𝑖 − 𝐚 𝑗 )

0 . Each dot represents an arm. (b) In the case of identifying all 𝜀 -best arms, the regions overlap. (c) These overlaps partition the space into seven distinct decision regions, increasing the difficulty of identification.

The sample complexity of an algorithm is quantified by the number of samples, denoted 𝜏 𝛿 , required to terminate the process. The goal of the algorithm design is to minimize the expected sample complexity 𝔼 𝝁 [ 𝜏 𝛿 ] across the entire set of algorithms ℋ . As introduced in Kaufmann et al. (2016), for 𝛿 ∈ ( 0 , 1 ) , the non-asymptotic problem complexity of an instance 𝝁 can be defined as

𝜅 ( 𝝁 ) ≔ inf 𝐴 𝑙 𝑔 𝑜 ∈ ℋ 𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 2.4 𝛿 ) ,

(16)

which is the smallest possible constant such that the expected sample complexity 𝔼 𝝁 [ 𝜏 𝛿 ] grows asymptotically in line with log ⁡ ( 1 2.4 𝛿 ) . The lower bound of the sample complexity 𝔼 𝝁 [ 𝜏 𝛿 ] can be represented in a general form by the following proposition. Building on the analytical framework of Proposition 3.1, we formulate the 𝜀 -best arms identification problem in the linear bandit setting. This enables us to derive both lower and upper bounds on the sample complexity, thereby establishing the near-optimality of our algorithm.

Proposition 3.1 (Qin and You (2025))

For any 𝛍 ∈ 𝑀 , there exists a set 𝒳

𝒳 ( 𝛍 ) and functions { 𝐶 𝑥 } 𝑥 ∈ 𝒳 with 𝐶 𝑥 : 𝒮 𝐾 × 𝒳 → ℝ + such that

𝜅 ( 𝝁 ) ≥ ( Γ 𝝁 ∗ ) − 1 ,

(17)

where

Γ 𝝁 ∗

max 𝒑 ∈ 𝒮 𝐾 ⁡ min 𝑥 ∈ 𝒳 ⁡ 𝐶 𝑥 ( 𝒑 ; 𝝁 ) .

(18)

In Proposition 3.1, 𝒮 𝐾 denotes the 𝐾 -dimensional probability simplex, and 𝒳

𝒳 ( 𝝁 ) is referred to as the culprit set. This set comprises critical subqueries (or comparisons) that must be correctly resolved. An error in any of these comparisons may hinder the identification of the correct set.

For example, in the case of identifying the best arm, the culprit set is given by 𝒳

{ 𝑖 : 𝑖 ∈ [ 𝐾 ] ∖ 𝑖 ∗ } , where 𝑖 ∗ denotes the unique best arm, and each subquery involves distinguishing every arm from the best arm. In the threshold bandit problem, the culprit set consists of all arms, 𝒳

{ 𝑖 : 𝑖 ∈ [ 𝐾 ] } , where each subquery requires accurately determining whether each arm exceeds the threshold. For the task of identifying the best 𝑚 arms, the culprit set is 𝒳

{ ( 𝑖 , 𝑗 ) : 𝑖 ∈ ℐ , 𝑗 ∈ ℐ 𝑐 } , where ℐ represents the set of the best 𝑚 arms, and each subquery entails comparing the mean of each arm in ℐ with those in the complement set ℐ 𝑐 .

The function 𝐶 𝑥 ( 𝒑 ; 𝝁 ) represents the population version of the sequential generalized likelihood ratio statistic, which provides an information-theoretic measure of how easily each subquery, corresponding to the culprit 𝑥 ∈ 𝒳 , can be answered.

In equation (18), the minimum aligns with the intuition that the instance posing the greatest challenge corresponds to the hardest subquery. The outer maximization seeks the optimal allocation of arms 𝒑 to effectively address this subquery. For a more detailed introduction to this general pure exploration model, please refer to Section EC.2. For our setting, we establish the below lower bound.

Theorem 3.2 (Lower Bound)

Consider a set of arms where arm 𝑖 follows a normal distribution 𝒩 ( 𝜇 𝑖 , 1 ) , where 𝜇 𝑖

𝐚 𝑖 ⊤ 𝛉 . Any 𝛿 -PAC algorithm for identifying all 𝜀 -best arms in the linear bandit setting must satisfy

inf 𝐴 𝑙 𝑔 𝑜 ∈ ℋ 𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 2.4 𝛿 ) ≥ ( Γ 𝝁 ∗ ) − 1

min 𝒑 ∈ 𝒮 𝐾 ⁡ max ( 𝑖 , 𝑗 , 𝑚 ) ∈ 𝒳 ⁡ max ⁡ { 2 ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 𝑖 ⊤ 𝜽 − 𝒂 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 1 ⊤ 𝜽 − 𝒂 𝑚 ⊤ 𝜽 − 𝜀 ) 2 } ,

(19)

where 𝒳

{ ( 𝑖 , 𝑗 , 𝑚 ) : 𝑖 ∈ 𝐺 𝜀 ( 𝛍 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝛍 ) } , 𝒮 𝐾 is the 𝐾 -dimensional probability simplex.

The detailed proof of the above theorem is presented in Section EC.4.3. For the lower bound derivation, we assume normally distributed rewards to obtain a closed-form expression. A similar bound can be derived under sub-Gaussian rewards, though the form is less explicit.

Remark 3.3 (Generality of the Lower Bound)

We note that the stochastic multi‑armed bandit problem is a special case of the linear bandit problem. By setting 𝒜

{ 𝐞 1 , 𝐞 2 , … , 𝐞 𝑑 } , where 𝐞 𝑖 denotes the unit vector, the linear bandit model reduces to a stochastic setting. This relationship allows us to recover the lower bound result for identifying all 𝜀 -best arms in stochastic bandits (Al Marjani et al. 2022). Furthermore, the lower bound in Theorem 3.2 extends the lower bound for best arm identification in linear bandits (Fiez et al. 2019). This result is recovered by setting 𝜀

0 and redefining the culprit set as 𝒳 ( 𝛍 )

{ 𝑖 : 𝑖 ∈ [ 𝐾 ] ∖ 𝑖 ∗ } , where 𝑖 ∗ represents the best arm in the context of best arms identification.

4Algorithm and Upper Bound

In this section, we propose the LinFACT algorithm (Linear Fast Arm Classification with Threshold estimation) to identify all 𝜀 -best arms in linear bandits efficiently. We then establish upper bounds on the expected sample complexity to demonstrate the optimality of the LinFACT algorithm. Specifically, the upper bound derived from the 𝒳 𝒴 -optimal sampling policy is shown to be instance optimal up to logarithmic factors.2

4.1Algorithm

LinFACT is a phase-based, semi-adaptive algorithm in which the sampling rule remains fixed within each round and is updated only at the end based on the accumulated observations. As the algorithm proceeds through round 𝑟 , LinFACT progressively refines two sets of arms:

•

𝐺 𝑟 : Arms empirically classified as 𝜀 -best (good).

•

𝐵 𝑟 : Arms empirically classified as not 𝜀 -best (bad).

This classification process continues until all arms have been assigned to either 𝐺 𝑟 or 𝐵 𝑟 . Once complete, the decision rule returns 𝐺 𝑟 as the final set of 𝜀 -best arms.

Sampling Rule. To minimize the sampling budget, we select arms that provide the maximum information about the mean values or the gaps between them. Unlike stochastic multi-armed bandits, where mean values are obtained exclusively by sampling specific arms, linear bandits allow these mean values to be inferred from the estimated parameters. In each round, arms are selected based on the G-optimal design (13) or the 𝒳 𝒴 -optimal design (15).

For G-optimal design, LinFACT-G refines an estimate of the true parameter 𝜽 and uses this estimate to maintain an anytime confidence interval, such that for each arm’s empirical mean value 𝜇 ^ 𝑖 , we have

ℙ ( ⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | ≤ 𝐶 𝛿 / 𝐾 ( 𝑟 ) ) ≥ 1 − 𝛿 .

(20)

The active set 𝒜 ( 𝑟 ) is defined as the set of uneliminated arms, as we continuously eliminate arms as round 𝑟 progresses. 𝒜 𝐼 ( 𝑟 ) denotes the set of indices corresponding to 𝒜 ( 𝑟 ) .

This confidence bound indicates that the algorithm maintains a probabilistic guarantee that the true mean value 𝜇 𝑖 is within a certain range of the estimated mean value 𝜇 ^ 𝑖 for each arm 𝑖 , uniformly over all rounds. The bound shrinks as more data is collected (since the confidence radius 𝐶 𝛿 / 𝐾 ( 𝑟 ) decreases with more samples), thereby reducing uncertainty. The anytime confidence width 𝐶 𝛿 / 𝐾 ( 𝑟 ) is maintained by the design of the sample budget in each round. We set 𝐶 𝛿 / 𝐾 ( 𝑟 )

2 − 𝑟

: 𝜀 𝑟 , which is halved with each iteration of the rounds.

In LinFACT-G, the initial budget allocation policy is based on the G-optimal design and is defined as follows

{ 𝑇 𝑟 ( 𝒂 )

⌈ 2 𝑑 𝜋 𝑟 ( 𝒂 ) 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) ⌉

𝑇 𝑟

∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) ,

(21)

where 𝑇 𝑟 denotes the total sampling budget allocated in round 𝑟 , and 𝜋 𝑟 is the selection probability distribution over the remaining active arms 𝒜 ( 𝑟 − 1 ) from the previous round, obtained via the G-optimal design as defined in equation (13). The sampling procedure for each round 𝑟 is described in Algorithm 4.1.

Algorithm 1 Subroutine: G-Optimal Sampling

1:Input: Projected active set 𝒜 ( 𝑟 − 1 ) , round 𝑟 , 𝛿 . 2:Obtain 𝜋 𝑟 ∈ 𝒫 ( 𝒜 ( 𝑟 − 1 ) ) with support size Supp ( 𝜋 𝑟 ) ≤ 𝑑 ( 𝑑 + 1 ) 2 according to equation (13). 3:for all 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) do ⊳ Sampling 4: Sample arm 𝒂 for 𝑇 𝑟 ( 𝒂 ) times in round 𝑟 , as specified in equation (21).

We also adopt the 𝒳 𝒴 -optimal design because the G-optimal design is less effective for distinguishing between arms. While the G-optimal design minimizes the maximum variance in estimating individual arm means, it does not explicitly focus on the pairwise gaps that are critical for identifying good arms. In contrast, the 𝒳 𝒴 -optimal design is tailored to directly reduce the uncertainty in estimating these gaps.

We now introduce a sampling rule based on the 𝒳 𝒴 -optimal design, as defined in equation (15). Let 𝑞 ( 𝜖 ) denote the error introduced by the rounding procedure. We have:

{ 𝑇 𝑟

max ⁡ { ⌈ 2 𝑔 𝒳 𝒴 ( 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ) ( 1 + 𝜖 ) 𝜀 𝑟 2 log ⁡ ( 2 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) 𝛿 ) ⌉ , 𝑞 ( 𝜖 ) }

𝑇 𝑟 ( 𝒂 )

Round ( 𝜋 𝑟 , 𝑇 𝑟 ) .

(22)

In contrast to the G-optimal design, the 𝒳 𝒴 -optimal design focuses on bounding the confidence region of the pairwise differences between arms. The following inequality characterizes the corresponding high-probability event.

ℙ ( ⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑗 ≠ 𝑖 ⋂ 𝑟 ∈ ℕ | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | ≤ 2 𝐶 𝛿 / 𝐾 ( 𝑟 ) ) ≥ 1 − 𝛿 .

(23)

The rounding operation, denoted as Round, uses a ( 1 + 𝜖 ) approximation algorithm proposed by Allen-Zhu et al. (2017). The complete sampling procedure is outlined in Algorithm 4.1.

Algorithm 2 Subroutine: 𝒳 𝒴 -Optimal Sampling

1:Input: Projected active set 𝒜 ( 𝑟 − 1 ) , round 𝑟 , 𝛿 . 2:Obtain 𝜋 𝑟 ∈ 𝒫 ( 𝒜 ( 𝑟 − 1 ) ) according to equation (15). 3:for all 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) do ⊳ Sampling 4: Sample arm 𝒂 for 𝑇 𝑟 ( 𝒂 ) times in round 𝑟 , as specified in equation (22).

Estimation. At the end of each round, after drawing 𝑇 𝑟 ( 𝒂 ) samples from the active set, we compute the empirical estimate of the parameter using standard ordinary least squares (OLS)

𝜽 ^ 𝑟

𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝒂 𝐴 𝑠 𝑋 𝑠 ,

(24)

where 𝑽 𝑟

∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) 𝒂 𝒂 ⊤ is the information matrix. The estimator for the mean value of each arm 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) is then

𝜇 ^ 𝑖

𝜇 ^ 𝑖 ( 𝑟 )

𝒂 𝑖 ⊤ 𝜽 ^ 𝑟 .

(25)

Stopping Rule and Decision Rule. As round 𝑟 progresses, LinFACT dynamically updates two sets of arms: 𝐺 𝑟 and 𝐵 𝑟 , representing arms that are empirically considered 𝜀 -best (good) and those that are not (bad), respectively. The algorithm filters arms by maintaining an upper confidence bound 𝑈 𝑟 and a lower confidence bound 𝐿 𝑟 around the unknown threshold 𝜇 1 − 𝜀 , along with individual upper and lower confidence bounds for each arm.

The stopping rule and final decision procedure are described in Algorithm 4.1. For each arm 𝑖 in the active set, LinFACT eliminates the arm if its upper confidence bound falls below the threshold 𝐿 𝑟 (line 6). Conversely, if the lower confidence bound of an arm exceeds 𝑈 𝑟 (line 8), the arm is added to the set 𝐺 𝑟 . Additionally, any arm already in 𝐺 𝑟 will be removed from the active set if its upper bound falls below the empirically largest lower bound among all active arms (line 10). This ensures that the best arm is always retained in the active set, which is necessary for estimating the threshold 𝜇 1 − 𝜀 . The classification process continues until all arms are categorized, that is, when 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] . At termination, the set 𝐺 𝑟 is returned as the output of LinFACT, representing the arms identified as 𝜀 -best.

Algorithm 3 Subroutine: Stopping Rule and Decision Rule

1:Input: Projected active set 𝒜 𝐼 ( 𝑟 − 1 ) , estimator ( 𝜇 ^ 𝑖 ( 𝑟 ) ) 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , round 𝑟 , 𝜀 , confidence radius 𝐶 𝛿 / 𝐾 ( 𝑟 ) . 2:Let 𝑈 𝑟

max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 + 𝐶 𝛿 / 𝐾 ( 𝑟 ) − 𝜀 . 3:Let 𝐿 𝑟

max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 − 𝐶 𝛿 / 𝐾 ( 𝑟 ) − 𝜀 . 4:for all 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) do ⊳ Arm Classification and Elimination 5: if 𝜇 ^ 𝑖 + 𝐶 𝛿 / 𝐾 ( 𝑟 ) < 𝐿 𝑟 then 6: Add 𝑖 to 𝐵 𝑟 and eliminate 𝑖 from 𝒜 𝐼 ( 𝑟 − 1 ) . 7: if 𝜇 ^ 𝑖 − 𝐶 𝛿 / 𝐾 ( 𝑟 ) > 𝑈 𝑟 then 8: Add 𝑖 to 𝐺 𝑟 . 9: if 𝑖 ∈ 𝐺 𝑟 and 𝜇 ^ 𝑖 + 𝐶 𝛿 / 𝐾 ( 𝑟 ) ≤ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝐶 𝛿 / 𝐾 ( 𝑟 ) then 10: Eliminate 𝑖 from 𝒜 𝐼 ( 𝑟 − 1 ) . 11:if 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] then ⊳ Stopping Condition and Recommendation 12: Output: the set 𝐺 𝑟 .

The Complete LinFACT Algorithm. The complete LinFACT algorithm is presented in Algorithm 4.1. The procedure proceeds as follows: based on the collected data, the decision-maker updates the parameter estimates and checks whether the stopping condition 𝜏 𝛿 is satisfied. If so, the set 𝐺 𝑟 is returned as the estimated set of all 𝜀 -best arms. If the stopping condition is not met, the process continues with further sampling and updates.

Algorithm 4 LinFACT Algorithm

1:Input: 𝜀 , 𝛿 , bandit instance. 2:Initialize 𝐺 0

∅ , the set of good arms, and 𝐵 0

∅ , the set of bad arms. 3:Initialize the active set 𝒜 ( 0 )

𝒜 , and 𝒜 𝐼 ( 0 )

[ 𝐾 ] . 4:for 𝑟

1 , 2 , … do 5: Set 𝐶 𝛿 / 𝐾 ( 𝑟 )

𝜀 𝑟

2 − 𝑟 . 6: Set 𝐺 𝑟

𝐺 𝑟 − 1 and 𝐵 𝑟

𝐵 𝑟 − 1 . 7: Project 𝒜 ( 𝑟 − 1 ) to a 𝑑 𝑟 -dimensional subspace that 𝒜 ( 𝑟 − 1 ) spans. ⊳ Projection 8: if Using G-optimal Sampling then ⊳ Sampling 9: Call Algorithm 4.1. 10: else if Using 𝒳 𝒴 -optimal Sampling then 11: Call Algorithm 4.1. 12: Estimate ( 𝜇 ^ 𝑖 ( 𝑟 ) ) 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) using equations (24) and (25). ⊳ Estimation 13: Call Algorithm 4.1. ⊳ Stopping Condition and Decision Rule 4.2Upper Bounds of the LinFACT Algorithm

Theorems 4.1 and 4.2 establish upper bounds on the sample complexity of the proposed LinFACT algorithm.

Let 𝑇 𝐺 and 𝑇 𝒳 𝒴 denote the number of samples required under the G-optimal and 𝒳 𝒴 -optimal designs, respectively. The formal statements of these theorems are given below.

Theorem 4.1 (Upper Bounds, G-Optimal Design)

For 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 , there exists an event ℰ such that ℙ ( ℰ ) ≥ 1 − 𝛿 . On this event, the LinFACT algorithm with the G-optimal sampling policy achieves an expected sample complexity upper bound given by

𝔼 [ 𝑇 𝐺 ∣ ℰ ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log 2 ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(26)

The detailed proof of Theorem 4.1 is presented in Section EC.5. However, the LinFACT algorithm based on the G-optimal design does not yield an upper bound that aligns with the lower bound. This limitation is discussed further in Section EC.6. In contrast, we will show that the algorithm using the 𝒳 𝒴 -optimal design achieves an upper bound that matches the lower bound up to a logarithmic factor.

Theorem 4.2 (Upper Bound, 𝒳 𝒴 -Optimal Design)

Assume that an instance of arms satisfies min 𝑖 ∈ 𝐺 𝜀 ∖ { 1 } ⁡ ‖ 𝐚 1 − 𝐚 𝑖 ‖ 2 ≥ 𝐿 2 and max 𝑖 ∈ [ 𝐾 ] ⁡ | 𝜇 1 − 𝜀 − 𝜇 𝑖 | ≤ 2 . There exists an event ℰ such that ℙ ( ℰ ) ≥ 1 − 𝛿 . On this event, the LinFACT algorithm with the 𝒳 𝒴 -optimal sampling policy achieves an expected sample complexity upper bound given by

𝔼 [ 𝑇 𝒳 𝒴 ∣ ℰ ]

𝒪 ( ( Γ ∗ ) − 1 𝑑 𝜉 − 1 log ⁡ ( 𝜉 − 1 ) log ⁡ ( 𝐾 𝛿 log ⁡ ( 𝜉 − 2 ) ) + 𝑑 𝜖 2 log ⁡ ( 𝜉 − 1 ) ) ,

(27)

where 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 is the minimum gap of the problem instance, 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } , and ( Γ ∗ ) − 1 is the lower bound term defined in Theorem 3.2.

The proof of the near-optimal upper bound in Theorem 4.2 is presented in Section EC.7 of the online appendix, where we make it clear how the lower bound helps to establish our upper bound.

5Model Misspecification

In this section, we address the challenge of model misspecification, recognizing that real-world problems may deviate from perfect linearity. To account for such deviations, we propose an orthogonal parameterization-based algorithm, i.e., LinFACT-MIS3, a refined version of LinFACT to address model misspecification. We establish new upper bounds in the misspecified setting and provide insights into how such deviations impact algorithm performance.

Under model misspecification, we refine the linear model in equation (3) as

𝑋 𝑡

𝒂 𝐴 𝑡 ⊤ 𝜽 + 𝜂 𝑡 + Δ 𝑚 ( 𝒂 𝐴 𝑡 ) ,

(28)

where Δ 𝑚 : ℝ 𝑑 → ℝ is a misspecification function quantifying the deviation from the true model.

{assumption}

Assume ‖ 𝝁 ‖ ∞ ≤ 𝐿 ∞ and ‖ 𝚫 𝑚 ‖ ∞ ≤ 𝐿 𝑚 , where ∥ ⋅ ∥ ∞ denotes the infinity norm and the bold 𝐾 -dimensional vector 𝚫 𝑚 represents the bias term of the misspecified model.

Therefore, with this assumption, the set of realizable models is defined as

𝑀 ≔ { 𝝁 ∈ ℝ 𝐾 | ∃ 𝜽 ∈ ℝ 𝑑 , ∃ 𝚫 𝑚 ∈ ℝ 𝐾 , 𝝁

𝚿 𝜽 + 𝚫 𝑚 ∧ ‖ 𝝁 ‖ ∞ ≤ 𝐿 ∞ ∧ ‖ 𝚫 𝑚 ‖ ∞ ≤ 𝐿 𝑚 } .

(29)

The key distinction in the analysis under model misspecification lies in how the estimator 𝝁 ^ 𝑡 is maintained. Specifically, we construct this estimator by projecting the empirical mean vector 𝝁 ~ 𝑡 at time 𝑡 onto the set of realizable models 𝑀 via the following optimization

𝝁 ^ 𝑡 ≔ arg ⁡ min 𝜗 ∈ 𝑀 ⁡ ‖ 𝜗 − 𝝁 ~ 𝑡 ‖ 𝑫 𝑵 𝑡 2 ,

(30)

where 𝑵 𝑡

[ 𝑁 𝑡 1 , 𝑁 𝑡 2 , … , 𝑁 𝑡 𝐾 ] ⊤ ∈ ℝ 𝐾 is the vector of sample counts for each arm at time 𝑡 , and 𝑫 𝑵 𝑡 ∈ ℝ 𝐾 × 𝐾 is the diagonal matrix with 𝑁 𝑡 1 , 𝑁 𝑡 2 , … , 𝑁 𝑡 𝐾 as its diagonal entries.

Figure 2:Difference Between Standard OLS and Misspecification-Adjusted Projection Estimates

Note. The left diagram shows the projection onto the span of pulled arms under a perfect linear model. The right diagram depicts the adjustment required under misspecification, where the projection must account for the deviation.

In the absence of misspecification, this projection simplifies to the ordinary least squares (OLS) estimator. However, as shown in Figure 2, under model misspecification, the estimator can no longer be computed as a simple projection onto a hyperplane and falls outside the scope of standard OLS. Instead, it must be formulated as the optimization problem in equation (30), which minimizes a weighted quadratic objective over 𝐾 + 𝑑 variables, subject to the constraints ‖ 𝝁 ‖ ∞ ≤ 𝐿 ∞ and ‖ 𝚫 𝑚 ‖ ∞ ≤ 𝐿 𝑚 .

5.1Upper Bound with Misspecification

In this subsection, we present the upper bound on the expected sample complexity for LinFACT-MIS. This analysis highlights the influence of misspecification on the theoretical performance of the algorithm. Let 𝑇 𝐺 mis denote the number of samples taken under model misspecification.

Theorem 5.1 (Upper Bound, Misspecification)

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 𝑑 , 𝛽 𝜀 2 𝑑 } . For 𝜉

min ⁡ ( 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 , 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) / 16 , there exists an event ℰ such that ℙ ( ℰ ) ≥ 1 − 𝛿 . On this event, LinFACT-MIS terminates and returns the correct solution with an expected sample complexity upper bound given by

𝔼 𝝁 [ 𝑇 𝐺 mis ∣ ℰ ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(31)

The proof of this theorem is provided in Section EC.8. The upper bound in Theorem 4.1, which assumes no model misspecification, can be viewed as a special case of Theorem 5.1 by setting 𝐿 𝑚

0 . However, the bound in Theorem 5.1 becomes invalid when the misspecification magnitude 𝐿 𝑚 is too large, as the logarithmic terms may involve negative arguments, violating the assumptions required for the bound to hold. If the variation in confidence radius across arms due to misspecification is not accounted for, Theorem 5.1 suggests that the sample complexity will increase. In particular, compared to the bound in Theorem 4.1, this result is looser because its denominator terms about 𝜀 decrease from 𝛼 𝜀 and 𝛽 𝜀 to 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 and 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 , respectively.

5.2Orthogonal Parameterization

In this section, we present an alternative version of LinFACT-MIS based on orthogonal parameterization, designed to improve computational efficiency.

Figure 3:Orthogonal Parameterization and Projection.

Note. The estimator 𝛍 ^ 𝑡 is obtained from the empirical mean 𝛍 ~ 𝑡 by solving an optimization problem. While the true mean vector 𝛍 can be expressed as the sum of a linear component 𝚿 𝛉 and a non-linear model deviation 𝚫 𝑚 , it can also be decomposed at each time step 𝑡 via orthogonal projection into a linear part 𝚿 𝛉 𝑡 on the hyperplane and a residual term 𝚫 𝑚 ( 𝑡 ) orthogonal to it.

Orthogonal Parameterization. Under model misspecification, traditional confidence bounds for the mean estimator based on ‖ 𝜽 ^ 𝑡 − 𝜽 ‖ 𝑽 𝑡 2 , derived using either martingale-based methods (Abbasi-Yadkori et al. 2011) or covering arguments (Lattimore and Szepesvári 2020), are no longer directly applicable due to the presence of an additional misspecification term. To improve the concentration of the estimator in this setting, a key strategy is to adopt an orthogonal parameterization of the mean vectors within the realizable model 𝑀 (Réda et al. 2021).

Rather than centering the confidence region around the true parameter 𝜽 , we focus on the quantity ‖ 𝜽 ^ 𝑡 − 𝜽 𝑡 ‖ 𝑽 𝑡 2 , where 𝜽 𝑡 is the orthogonal projection of the true mean vector onto the feature space spanned by the pulled arms at time 𝑡 . This 𝜽 𝑡 -centered form corresponds to a self-normalized martingale and thus satisfies the same concentration bounds as in the classical linear bandit setting without misspecification. This approach offers an advantage over prior methods (Lattimore et al. 2020, Zanette et al. 2020), which require inflating the confidence radius between 𝜽 ^ 𝑡 and 𝜽 by a factor of 𝐿 𝑚 2 𝑡 , leading to overly conservative bounds in misspecified settings where 𝐿 𝑚 ≫ 0 .

Specifically, we show that any mean vector 𝝁

𝚿 𝜽 + 𝚫 𝑚 can be equivalently expressed at any time 𝑡 as 𝝁

𝚿 𝜽 𝑡 + 𝚫 𝒎 ( 𝒕 ) , where

𝜽 𝑡

( 𝚿 𝑁 𝑡 ⊤ 𝚿 𝑁 𝑡 ) − 1 𝚿 𝑁 𝑡 ⊤ 𝑫 𝑁 𝑡 1 / 2 𝝁

𝑽 𝑡 − 1 ∑ 𝑠

1 𝑡 𝜇 𝐴 𝑠 𝒂 𝐴 𝑠

(32)

is the orthogonal projection of 𝝁 onto the feature space spanned by the columns of 𝚿 𝑁 𝑡 , and 𝚫 𝒎 ( 𝒕 )

𝝁 − 𝚿 𝜽 𝑡 is the residual. Here, 𝚿 𝑁 𝑡

𝑫 𝑁 𝑡 1 / 2 𝚿 is the matrix of feature vectors weighted by the number of times each arm has been sampled up to time 𝑡 , and 𝑫 𝑁 𝑡 1 / 2 is a diagonal matrix with entries corresponding to the square roots of the number of samples for each arm.

Upper Bound. For LinFACT-MIS, the orthogonal parameterization involves updating the sampling rule in Algorithm 4.1 to the sampling rule in Algorithm 5.2. The estimation is no longer based on OLS but is achieved by calculating the estimator ( 𝜇 ^ 𝑖 ( 𝑟 ) ) 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) with observed data by solving the optimization problem described in equation (30) directly.

When model misspecification is accounted for and orthogonal parameterization is applied, the sampling policy is given by:

{ 𝑇 𝑟 ( 𝒂 )

⌈ 8 𝑑 𝜋 𝑟 ( 𝒂 ) 𝜀 𝑟 2 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) ) ⌉

𝑇 𝑟

∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) .

(33)

Let 𝑇 𝐺 op denote the total number of samples required when orthogonal parameterization is used. The corresponding upper bound is stated in the theorem below.

Algorithm 5 Subroutine: Sampling With Orthogonal Parameterization

1:Input: Projected active set 𝒜 𝐼 ( 𝑟 − 1 ) , round 𝑟 , 𝛿 . 2:Find the G-optimal design 𝜋 𝑟 ∈ 𝒫 ( 𝒜 ( 𝑟 − 1 ) ) with Supp ( 𝜋 𝑟 ) ≤ 𝑑 ( 𝑑 + 1 ) 2 according to equation (13). 3:for all 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) do ⊳ Sampling 4: Sample arm 𝒂 for 𝑇 𝑟 ( 𝒂 ) times in round 𝑟 , as specified in equation (33). Theorem 5.2 (Upper Bound, Orthogonal Parameterization)

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 ( 𝑑 + 2 ) , 𝛽 𝜀 2 ( 𝑑 + 2 ) } . For 𝜉

min ⁡ ( 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) , 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) / 16 , there exists an event ℰ such that ℙ ( ℰ ) ≥ 1 − 𝛿 . On this event, LinFACT-MIS terminates and returns the correct solution with an expected sample complexity upper bound given by

𝔼 𝝁 [ 𝑇 𝐺 op ∣ ℰ ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 6 𝑑 𝛿 log ⁡ ( 𝜉 − 1 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(34)

This theorem establishes that the upper bound remains of the same order as in Theorem 5.1, with the detailed proof provided in Section EC.9. As in Theorem 5.1, the use of orthogonal parameterization does not eliminate the expansion of the upper bound, which remains unavoidable. We provide further intuition in Section 5.3, arguing that without prior knowledge of the misspecification, it is not possible to recover full performance through algorithmic refinement alone.

5.3Insights for Model Misspecification

Lower Bounds in Linear and Stochastic Settings. The lower bound becomes equivalent to the unstructured lower bound as soon as the misspecification upper bound 𝐿 𝑚

𝐿 𝝁 , where 𝐿 𝝁 is an instance-dependent finite constant. This observation is formalized in Proposition 5.3, whose proof follows the same logic as Lemma 2 in Réda et al. (2021).

Proposition 5.3

There exists 𝐿 𝛍 ∈ ℝ with 𝐿 𝛍 ≤ max 𝑘 ⁡ 𝜇 𝑘 − min 𝑘 ⁡ 𝜇 𝑘 such that if 𝐿 𝑚

𝐿 𝛍 , then for any pure exploration task, the lower bound in the linear setting is equal to the unstructured lower bound.

Improvement with Unknown Misspecification is Not Possible. Knowing that a problem is misspecified without access to an upper bound 𝐿 𝑚 on ‖ 𝚫 𝑚 ‖ ∞ is effectively equivalent to having no structural knowledge of the problem. As a result, improving algorithmic performance under such unknown model misspecification is infeasible. In particular, as shown in cumulative regret settings, sublinear regret guarantees are no longer achievable (Ghosh et al. 2017, Lattimore et al. 2020); similarly, in pure exploration, the theoretical lower bound cannot be attained.

Prior Knowledge for the Misspecification. When the upper bound 𝐿 𝑚 on model misspecification is known in advance, LinFACT-MIS can be modified to account for this deviation. Specifically, we adjust the confidence radius 𝐶 𝛿 / 𝐾 ( 𝑟 ) used in computing the lower and upper bounds (i.e., 𝐿 𝑟 and 𝑈 𝑟 ) in Algorithm 4.1. With this modification, the number of rounds required to complete classification under misspecification, 𝑅 upper ′ and 𝑅 upper ′′ , coincides with 𝑅 upper , the corresponding number of rounds under a perfectly linear model. This is achieved by replacing the confidence radius 𝜀 𝑟 with an inflated version 𝜀 𝑟 + 𝐿 𝑚 𝑑 to compensate for the worst-case deviation due to misspecification. This adjustment preserves the validity of the original analysis and ensures that the same theoretical guarantees are retained. The following proposition formalizes this observation and is proved in Section EC.10.

Proposition 5.4

Suppose the misspecification magnitude 𝐿 𝑚

0 is known in advance. Adjusting the confidence radius 𝐶 𝛿 / 𝐾 ( 𝑟 ) in Algorithm 4.1 at each round 𝑟 from its original value 𝜀 𝑟 to

𝜀 𝑟 ′

𝜀 𝑟 + 𝐿 𝑚 𝑑 ,

(35)

ensures that the total number of rounds required under misspecification matches that under perfect linearity.

6Generalized Linear Model

In this section, we extend the linear bandits to a generalized linear model (GLM). In this setting, the reward function no longer follows the standard linear form in equation (3), but instead satisfies

𝔼 [ 𝑋 𝑡 ∣ 𝐴 𝑡 ]

𝜇 link ( 𝒂 𝐴 𝑡 ⊤ 𝜽 ) ,

(36)

where 𝜇 link : ℝ → ℝ is the inverse link function. GLMs encompass a class of models that include, but are not limited to, linear models, allowing for various reward distributions beyond the Gaussian. For example, for binary-valued rewards, a suitable choice of 𝜇 link is 𝜇 link ( 𝑥 )

exp ⁡ ( 𝑥 ) / ( 1 + exp ⁡ ( 𝑥 ) ) , i.e., sigmoid function, leading to the logistic regression model. For integer-valued rewards, 𝜇 link ( 𝑥 )

exp ⁡ ( 𝑥 ) leads to the Poisson regression model.

To keep this paper self-contained, we briefly review the main properties of GLMs (McCullagh 2019). A univariate probability distribution belongs to a canonical exponential family if its density with respect to a reference measure is given by

𝑝 𝜔 ( 𝑥 )

exp ⁡ ( 𝑥 𝜔 − 𝑏 ( 𝜔 ) + 𝑐 ( 𝑥 ) ) ,

(37)

where 𝜔 is a real parameter, 𝑐 ( ⋅ ) is a real normalization function, and 𝑏 ( ⋅ ) is assumed to be twice continuously differentiable. This family includes the Gaussian and Gamma distributions when the reference measure is the Lebesgue measure, and the Poisson and Bernoulli distributions when the reference measure is the counting measure on the integers. For a random variable 𝑋 with density defined in (37), 𝔼 ( 𝑋 )

𝑏 ˙ ( 𝜔 ) and Var ( 𝑋 )

𝑏 ¨ ( 𝜔 ) , where 𝑏 ˙ and 𝑏 ¨ denote the first and second derivatives of 𝑏 , respectively. Since the variance is always positive and 𝜇 link

𝑏 ˙ represents the inverse link function, 𝑏 is strictly convex and 𝜇 link is increasing.

The canonical GLM assumes that 𝑝 𝜽 ( 𝑋 ∣ 𝒂 𝑖 )

𝑝 𝒂 𝑖 ⊤ 𝜽 ( 𝑋 ) for all arms 𝑖 . The maximum likelihood estimator 𝜽 ^ 𝑡 , based on 𝜎 -algebra ℱ 𝑡

𝜎 ( 𝐴 1 , 𝑋 1 , 𝐴 2 , 𝑋 2 , … , 𝐴 𝑡 , 𝑋 𝑡 ) , is defined as the maximizer of the function

∑ 𝑠

1 𝑡 log ⁡ 𝑝 𝜽 ( 𝑋 𝑠 | 𝒂 𝐴 𝑠 )

∑ 𝑠

1 𝑡 𝑋 𝑠 𝒂 𝐴 𝑠 ⊤ 𝜽 − 𝑏 ( 𝒂 𝐴 𝑠 ⊤ 𝜽 ) + 𝑐 ( 𝑋 𝑠 ) .

(38)

This function is strictly concave in 𝜽 . By differentiating, we obtain that 𝜽 ^ 𝑡 is the unique solution of the following estimating equation at time 𝑡 ,

∑ 𝑠

1 𝑡 ( 𝑋 𝑠 − 𝜇 link ( 𝒂 𝐴 𝑠 ⊤ 𝜽 ) ) 𝒂 𝐴 𝑠

𝟎 .

(39)

In practice, while the solution to equation (39) does not have a closed-form solution, it can be efficiently found using methods such as iteratively reweighted least squares (IRLS) (Wolke and Schwetlick 1988), which employs Newton’s method. Here, 𝜽 ˇ 𝑡 is a convex combination of 𝜽 and its maximum likelihood estimate 𝜽 ^ 𝑡 at time 𝑡 . The existence of 𝑐 min can be ensured by performing forced exploration at the beginning of the algorithm, incurring a sampling cost of 𝑂 ( 𝑑 ) (Kveton et al. 2023).

{assumption}

The derivative of the inverse link function, 𝜇 link , is bounded, i.e., 𝑐 min ≤ 𝜇 ˙ link ( 𝒂 ⊤ 𝜽 ˇ 𝑡 ) , for some 𝑐 min ∈ ℝ + and all arms.

Assumption 6 is standard in the GLM literature (Li et al. 2017, Azizi et al. 2021b), ensuring that the reward function is sufficiently smooth, with 𝑐 min

0 typically determined by the choice of link function.

6.1Algorithm with GLM

In this section, we present a refined algorithm for the generalized linear model, referred to as LinFACT-GLM. This refinement involves modifying the sampling rule in Algorithm 4.1 and adjusting the estimation method. The designed sampling policy is described by

{ 𝑇 𝑟 ( 𝒂 )

⌈ 2 𝑑 𝜋 𝑟 ( 𝒂 ) 𝜀 𝑟 2 𝑐 min 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) ⌉

𝑇 𝑟

∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) ,

(40)

where 𝑐 min is the known constant controlling the first-order derivative of the inverse link function.

Algorithm 6 Subroutine: G-Optimal Sampling with GLM

1:Input: Projected active set 𝒜 𝐼 ( 𝑟 − 1 ) , round 𝑟 , 𝛿 . 2:Find the G-optimal design 𝜋 𝑟 ∈ 𝒫 ( 𝒜 ( 𝑟 − 1 ) ) with support size Supp ( 𝜋 𝑟 ) ≤ 𝑑 ( 𝑑 + 1 ) 2 according to equation (13). 3:for all 𝒂 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) do ⊳ Sampling 4: Sample arm 𝒂 for 𝑇 𝑟 ( 𝒂 ) times in round 𝑟 , as specified in equation (40).

In the GLM setting, Ordinary Least Squares (OLS) is also not applicable. Instead, the estimator for each 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) is obtained by solving the optimization problem described in equation (38), using the estimating equation in (39) derived from the observed data.

6.2Upper Bound for the GLM-Based LinFACT

Let 𝑇 𝐺 GLM denote the number of samples collected under the GLM setting. The following theorem provides an upper bound on the expected sample complexity of LinFACT-GLM.

Theorem 6.1 (Upper Bound, Generalized Linear Model)

For 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 , there exists an event ℰ such that ℙ ( ℰ ) ≥ 1 − 𝛿 . On this event, the LinFACT-GLM algorithm achieves an expected sample complexity upper bound given by

𝔼 [ 𝑇 GLM ∣ ℰ ]

𝒪 ( 𝑑 𝑐 min 2 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log 2 ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(41)

The upper bound presented in Theorem 6.1, which generalizes the model to the GLM setting, can be viewed as an extension of Theorem 4.1. The detailed proof is provided in Section EC.11 of the online appendix.

7Numerical Experiments

In the numerical experiments, we compare our algorithm, LinFACT, with several baseline methods. These include the Bayesian optimization algorithm based on the knowledge-gradient acquisition function with correlated beliefs for best arm identification (KGCB) proposed by Negoescu et al. (2011); the gap-based algorithm for best arm identification (BayesGap) introduced by Hoffman et al. (2013); the track-and-stop algorithm for threshold bandits (Lazy TTS) developed by Rivera and Tewari (2024); and two gap-based algorithms for top 𝑚 arm identification, LinGIFA and 𝑚 -LinGapE, presented by Réda et al. (2021a), which represent the state-of-the-art for returning multiple candidates.

Identifying all 𝜀 -best arms is often more challenging than identifying the top 𝑚 arms or arms above a given threshold. To address this, we adopt a random setting where both the number of 𝜀 -best arms and the 𝜀 -threshold are randomly sampled. In this setting, top 𝑚 algorithms and threshold bandit algorithms only have access to the expected reward values, ensuring a fair comparison. Since BayesGap and KGCB operate under a fixed-budget setting, we use the average sample complexity of LinFACT as the budget for comparison and evaluate their performance accordingly. For BAI algorithms, we select arms whose empirical means are within 𝜀 of the empirical best arm once the budget is exhausted.

(a)Synthetic I - Adaptive Setting (b)Synthetic II - Static Setting Figure 4:Illustration of the Synthetic Experiment Settings

Note. In the adaptive setting (a), we randomly sample the threshold 𝑋 ~ from a distribution with mean 1, and independently sample the number of 𝜀 -best arms 𝑚 ~ from a distribution with mean 𝑚 . The arms above and below the threshold are then uniformly drawn from the intervals [ 𝑋 ~ , 𝑋 ~ + 𝜀 ] and [ 𝑋 ~ − 𝜀 , 𝑋 ~ ) , respectively. In the static setting (b), we fix the threshold at Δ / 2 and randomly sample 𝑚 ~

𝜀 -best arms with reward Δ .

7.1Synthetic Experiments

Following Soare et al. (2014), Xu et al. (2018), and Azizi et al. (2021b), we categorize synthetic data into two types: adaptive and static settings. Figure 4 illustrates the construction of these synthetic datasets, with detailed configurations provided in Section EC.12. A summary of all settings is presented in Table 1.

In the adaptive setting, arms are divided into three categories: (1) the arms to be selected (i.e., the all 𝜀 -best arms), (2) disturbing arms that are slightly worse, and (3) base arms with zero rewards. The primary challenge for algorithms is to distinguish between the arms in categories (1) and (2), while the base arms in category (3) can be ignored. Adaptive algorithms that effectively leverage shared information to explore similar arms perform well in this setting.

In the static setting, arms are divided into two categories: the all 𝜀 -best arms and the base arms with zero rewards. In this case, algorithms must distinguish between all arms. Static algorithms that uniformly explore all arms are well-suited for this setting.

Table 1:Synthetic Experiment Settings Setting Index Setting Category Setting Details 1 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 8 , 4 ) , 𝜀

0.1

2 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 8 , 4 ) , 𝜀

0.2

3 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 8 , 4 ) , 𝜀

0.3

4 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 12 , 4 ) , 𝜀

0.1

5 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 12 , 4 ) , 𝜀

0.2

6 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 12 , 4 ) , 𝜀

0.3

7 Static ( 𝑑 , Δ )

( 8 , 1 )

8 Static ( 𝑑 , Δ )

( 12 , 1 )

9 Static ( 𝑑 , Δ )

( 16 , 1 ) Remark 7.1

A common misconception is that adaptive algorithms universally outperform static ones. While adaptive algorithms are typically efficient at focusing exploration on promising arms in many settings, they can be less effective in static environments. In such cases, adaptive methods may inefficiently allocate samples between both candidate and baseline arms, leading to redundant exploration. In contrast, static algorithms can achieve the objective more efficiently by uniformly allocating samples across all arms, avoiding bias and over-exploration.

Experiment Setup. We benchmark our algorithms, LinFACT-G and LinFACT- 𝒳 𝒴 , against BayesGap, KGCB, LinGIFA, m-LinGapE, and Lazy TTS, focusing on both sample complexity and the 𝐹 1 score.

We conduct each experiment using different data types (adaptive or static), arm dimensions ( 𝑑 ), and numbers of arms ( 𝐾 ). For each configuration, we generate 10 values of 𝑚 from a normal distribution centered at the expected value 𝔼 [ 𝑚 ] with a variance of 3.0, where 𝑚 is the input for a top- 𝑚 algorithm. For each sampled pair ( 𝑚 ~ , 𝑋 ~ ) , where 𝑚 ~ denotes the number of 𝜀 -best arms and 𝑋 ~ represents the value of the best arm minus 𝜀 , we repeat the experiment 100 times. We then compute the average 𝐹 1 score across the 10 ( 𝑚 ~ , 𝑋 ~ ) pairs, resulting in 1,000 total executions per algorithm.

In practice, we observe that KGCB, LinGIFA, m-LinGapE, and Lazy TTS are computationally intensive when the sampling budget is high. The time-consuming nature of KGCB has already been noted in the literature (Negoescu et al. 2011). For Lazy TTS, the algorithm requires repeatedly evaluating an objective function within an optimization problem, where each evaluation has a time complexity of 𝑂 ( 𝐾 𝑑 2 + 𝑑 3 ) . As the optimization process involves a non-negligible number of iterations ( 𝑛 ), the total time complexity becomes 𝑂 ( 𝑛 𝐾 𝑑 2 ) , making the algorithm inefficient.

For the two top- 𝑚 algorithms, the computational burden arises from performing matrix inversions for all arms, leading to a total time complexity of 𝑂 ( 𝐾 𝑑 3 ) . In contrast, LinFACT achieves a significantly lower total time complexity of 𝑂 ( 𝐾 𝑑 2 ) , as the optimization problem within our algorithm can be efficiently solved using a fixed-step gradient descent method. Table 2 presents the runtime for synthetic data, demonstrating that our algorithm is at least five times faster than all other methods. Notably, this performance gap is even more pronounced when using real data.

Table 2:Running Time (seconds) for Different Synthetic Experiments Among Algorithms Algorithm Settings Adaptive Settings Static Settings

( 𝑑 , 𝔼 [ 𝑚 ] )

( 8 , 4 )
( 𝑑 , 𝔼 [ 𝑚 ] )

( 12 , 4 )
Δ

1

𝜀

0.1
𝜀

0.2
𝜀

0.3
𝜀

0.1
𝜀

0.2
𝜀

0.3
𝑑

8
𝑑

12
𝑑

16

LinFACT-G 0.028

0.030

0.028

0.078

0.076

0.005

0.007

0.009

LinFACT- 𝒳 𝒴

0.037

0.038

0.163

0.155

0.145

0.015

0.028

0.049

BayesGap 0.112

0.110

0.306

0.304

0.307

0.036

0.071

0.112

LinGIFA 0.313

0.265

0.236

1.943

1.484

1.424

0.160

0.529

1.297

m-LinGapE 0.195

0.166

0.157

1.362

1.053

0.984

0.116

0.352

0.932

Lazy TTS 0.991

3.904

12.512

24.583

26.941

25.622

0.198

0.871

2.164

KGCB 1.990

1.936

1.670

6.245

6.611

6.082

0.303

0.745

1.386

Note. The best result is in bold and the second best is underlined.

Experiment Results. Our experimental results are presented in Figures 5 and 6. In Figure 5, the vertical axis denotes the 𝐹 1 score, with higher values indicating better algorithm performance. The first row of six plots shows the results under adaptive settings. As 𝜀 increases, the non-optimal (disturbing) arms in the adaptive setting move progressively farther from the optimal arms, making them easier to distinguish. Consequently, the 𝐹 1 score increases from left to right. An exception is BayesGap, which performs best when 𝜀

0.2 . This occurs because best-arm identification algorithms, such as BayesGap, struggle to differentiate optimal arms from disturbing ones when they are close ( 𝜀

0.1 ) and fail to fully explore the optimal arms when they are not closely clustered with the best arm ( 𝜀

0.3 ).

Our LinFACT algorithms consistently outperform top- 𝑚 algorithms, BayesGap, and KGCB. While our algorithms perform slightly worse than Lazy TTS for some cases, they have much lower sample complexity, as shown in Figure 6, meaning that Lazy TTS requires substantially more samples to achieve these results. When comparing LinFACT-G and LinFACT- 𝒳 𝒴 , we observe that in adaptive settings, the 𝐹 1 scores are similar, but LinFACT- 𝒳 𝒴 achieves lower sample complexity. In static settings, however, LinFACT-G attains a higher 𝐹 1 score with a reduced sample complexity. This difference stems from the distinct focus of the two designs: the 𝒳 𝒴 -optimal design prioritizes pulling arms to obtain better estimates along the directions representing differences between arms, while the G-optimal design aims to improve estimates along the directions representing all arms.

Figure 5: 𝐹 1 Scores for Different Synthetic Experiments Among Algorithms

Note. The y-axis reports the F1 score, which reflects how accurately each algorithm identifies 𝜀 -best arms. LinFACT-G and LinFACT- 𝒳 𝒴 consistently achieve high F1 scores. While Lazy TTS occasionally attains higher scores, we demonstrate in the next figure that it requires significantly more samples to do so. Detailed configurations of each experimental setting are provided in Table 1.

Figure 6:Sample Complexity for Different Synthetic Experiments Among Algorithms

Note. LinFACT-G and LinFACT- 𝒳 𝒴 demonstrate sample complexities comparable to LinGIFA and m-LinGapE while achieving higher F1 scores in the adaptive settings (Settings 1 to 6), and consistently outperform all other algorithms in the static settings (Settings 7 to 9). BayesGap and KGCB are excluded from this comparison as they are designed for the fixed-budget setting, and thus their sample complexity is not well-defined.

7.2Experiments with Real Data - Drug Discovery

Experiment Setup. We adopt the Free-Wilson model (Katz et al. 1977) and use real data from a drug discovery task. The Free-Wilson model is a linear framework in which the overall efficacy of a compound is expressed as the sum of the contributions from each substituent on the base molecule, along with the effect of the base molecule itself (Negoescu et al. 2011).

Figure 7:An Example of Molecule and Substituent Locations with Site 1 , … , Site 5

Note. We begin with a base molecule containing multiple attachment sites for chemical substituents. By varying these substituents, we generate a diverse set of compounds and aim to identify those with desirable properties.

Each compound is modeled as an arm represented by a binary indicator vector. Suppose there are 𝑁 modification sites, with site 𝑛 ∈ [ 𝑁 ] offering 𝑙 𝑛 alternative substituents. Then each arm 𝒂 lies in ℝ 1 + ∑ 𝑛 ∈ [ 𝑁 ] 𝑙 𝑛 , where the initial entry corresponds to the base molecule (i.e., the intercept term). For each site, the corresponding segment in the vector has exactly one entry set to 1 (indicating the chosen substituent), with the remaining 𝑙 𝑛 − 1 entries set to 0. This results in a total of ∏ 𝑛 ∈ [ 𝑁 ] 𝑙 𝑛 unique compound configurations.

We conducted our experiments using data as described in Katz et al. (1977), retaining only the non-zero entries. The base molecule, illustrated in Figure 7, contains five sites where substituents can be attached. Each site offers 4, 5, 4, 3, and 4 candidate substituents, respectively, resulting in 960 possible compounds. Each arm 𝒂 is represented as a vector in ℝ 21 .

We benchmarked our algorithms, LinFACT-G and LinFACT- 𝒳 𝒴 , against LinGIFA and m-LinGapE4, evaluating how precision, recall, 𝐹 1 score, and sample complexity vary with the failure probability 𝛿 . In this study, the good set was defined to include 20 𝜀 -best arms, which corresponds to 𝜀

4.325 . All methods were tested over 10 trials, with 𝛿 ranging from 0.1 to 0.9 in increments of 0.1.

Experiment results. The experimental results are shown in Figure 8. LinFACT-G and LinFACT- 𝒳 𝒴 consistently deliver high precision, recall, and 𝐹 1 scores across various failure probabilities 𝛿 , as shown in Figures 8a–8c. In particular, their precision remains close to 1.0, indicating that nearly all selected arms are truly 𝜀 -best. Their recall is also robust across 𝛿 , suggesting strong coverage of the good set with minimal omission. This balance between high precision and recall leads to 𝐹 1 scores that remain consistently near optimal, even as 𝛿 varies from 0.1 to 0.9. In contrast, LinGIFA and m-LinGapE exhibit larger fluctuations and overall lower values in all three metrics, with especially degraded recall and 𝐹 1 performance at moderate values of 𝛿 .

In addition, as illustrated in Figure 8d, LinFACT- 𝒳 𝒴 achieves the lowest sample complexity, followed closely by LinFACT-G, with both outperforming all baseline methods. These performance disparities highlight the reliability and robustness of LinFACT methods. Finally, LinFACT-G and LinFACT- 𝒳 𝒴 also demonstrate strong computational efficiency: both complete the task within one minute, whereas LinGIFA and m-LinGapE take about four times longer, and Lazy TTS requires over 30 minutes.

(a)Precision (b)Recall (c) 𝐹 1 Score (d)Sample Complexity Figure 8:Precision, Recall, and 𝐹 1 Score for Various Failure Probabilities 𝛿

Note. LinFACT-G and LinFACT- 𝒳 𝒴 consistently demonstrate the best performance across all metrics. LinFACT- 𝒳 𝒴 achieves the lowest sample complexity while maintaining high accuracy. BayesGap and KGCB are excluded as they operate under a fixed-budget setting. Lazy TTS is also omitted due to excessive runtime.

8Conclusion

In this paper, we address the challenge of identifying all 𝜀 -best arms in linear bandits, motivated by applications such as drug discovery. We establish the first information-theoretic lower bound to characterize the problem’s complexity and derive a matching upper bound. Our LinFACT algorithm achieves instance-optimal performance up to a logarithmic factor under the 𝒳 𝒴 -optimal design criterion.

We further extend our analysis to settings with model misspecification and generalized linear models (GLMs), deriving new upper bounds and providing insights into algorithmic behavior under these broader conditions. These results generalize and recover the guarantees from the perfectly linear case as special instances. Our numerical experiments confirm that LinFACT outperforms existing methods in both sample and computational efficiency, while maintaining high accuracy in identifying all 𝜀 -best arms.

Future Research Directions. First, while our current work focuses on the fixed-confidence setting, many real-world applications operate under a fixed sampling budget, where the objective is to achieve the best outcome within a limited number of trials. Extending the algorithm to this fixed-budget setting and establishing corresponding theoretical guarantees remains an important avenue for exploration. Moreover, deriving fundamental lower bounds in the fixed-budget regime is still an open question. Second, although we have proposed extensions for both misspecified linear bandits and generalized linear models (GLMs), future work could benefit from developing separate algorithms tailored to each setting. Such targeted designs may offer improved performance.

\c@NAT@ctr References Abbasi-Yadkori et al. (2011) ↑ Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. Abe and Long (1999) ↑ Abe N, Long PM (1999) Associative reinforcement learning using linear probabilistic concepts. ICML, 3–11 (Citeseer). Abernethy et al. (2016) ↑ Abernethy JD, Amin K, Zhu R (2016) Threshold bandits, with and without censored feedback. Advances In Neural Information Processing Systems 29. Al Marjani et al. (2022) ↑ Al Marjani A, Kocak T, Garivier A (2022) On the complexity of all 𝜀 -best arms identification. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 317–332 (Springer). Alaei et al. (2022) ↑ Alaei S, Makhdoumi A, Malekian A, Pekeč S (2022) Revenue-sharing allocation strategies for two-sided media platforms: Pro-rata vs. user-centric. Management Science 68(12):8699–8721. Allen-Zhu et al. (2017) ↑ Allen-Zhu Z, Li Y, Singh A, Wang Y (2017) Near-optimal discrete optimization for experimental design: A regret minimization approach. Allen-Zhu et al. (2021) ↑ Allen-Zhu Z, Li Y, Singh A, Wang Y (2021) Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming 186:439–478. Azizi et al. (2021) ↑ Azizi MJ, Kveton B, Ghavamzadeh M (2021) Fixed-budget best-arm identification in structured bandits. arXiv preprint arXiv:2106.04763 . Bechhofer (1954) ↑ Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal populations with known variances. The Annals of Mathematical Statistics 16–39. Boyd and Bilegan (2003) ↑ Boyd EA, Bilegan IC (2003) Revenue management and e-commerce. Management science 49(10):1363–1386. Bubeck et al. (2012) ↑ Bubeck S, Cesa-Bianchi N, et al. (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5(1):1–122. Bubeck et al. (2013) ↑ Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. International Conference on Machine Learning, 258–265 (PMLR). Chen et al. (2000) ↑ Chen CH, Lin J, Yücesan E, Chick SE (2000) Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems 10:251–270. Das et al. (2021) ↑ Das P, Sercu T, Wadhawan K, Padhi I, Gehrmann S, Cipcigan F, Chenthamarakshan V, Strobelt H, Dos Santos C, Chen PY, et al. (2021) Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering 5(6):613–623. Elmaghraby and Keskinocak (2003) ↑ Elmaghraby W, Keskinocak P (2003) Dynamic pricing in the presence of inventory considerations: Research overview, current practices, and future directions. Management science 49(10):1287–1309. Even-Dar et al. (2006) ↑ Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research 7(6). Fan et al. (2016) ↑ Fan W, Hong LJ, Nelson BL (2016) Indifference-zone-free selection of the best. Operations Research 64(6):1499–1514. Feng et al. (2025) ↑ Feng Q, Ma T, Zhu R (2025) Satisficing regret minimization in bandits. Proceedings of the 13th International Conference on Learning Representations. Feng et al. (2022) ↑ Feng Y, Caldentey R, Ryan CT (2022) Robust learning of consumer preferences. Operations Research 70(2):918–962. Fiez et al. (2019) ↑ Fiez T, Jain L, Jamieson KG, Ratliff L (2019) Sequential experimental design for transductive linear bandits. Advances in neural information processing systems 32. Frazier et al. (2009) ↑ Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing 21(4):599–613. Frazier et al. (2008) ↑ Frazier PI, Powell WB, Dayanik S (2008) A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47(5):2410–2439. Free and Wilson (1964) ↑ Free SM, Wilson JW (1964) A mathematical contribution to structure-activity studies. Journal of medicinal chemistry 7(4):395–399. Gabillon et al. (2012) ↑ Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems 25. Ghosh et al. (2017) ↑ Ghosh A, Chowdhury SR, Gopalan A (2017) Misspecified linear bandits. Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Godinho de Matos et al. (2018) ↑ Godinho de Matos M, Ferreira P, Smith MD (2018) The effect of subscription video-on-demand on piracy: Evidence from a household-level randomized experiment. Management Science 64(12):5610–5630. Hoffman et al. (2013) ↑ Hoffman MW, Shahriari B, de Freitas N (2013) Exploiting correlation and budget constraints in bayesian multi-armed bandit optimization. arXiv preprint arXiv:1303.6746 . Hong et al. (2021) ↑ Hong LJ, Fan W, Luo J (2021) Review on ranking and selection: A new perspective. Frontiers of Engineering Management 8(3):321–343. Huang et al. (2007) ↑ Huang Z, Zeng DD, Chen H (2007) Analyzing consumer-product graphs: Empirical findings and applications in recommender systems. Management science 53(7):1146–1164. Kalyanakrishnan and Stone (2010) ↑ Kalyanakrishnan S, Stone P (2010) Efficient selection of multiple bandit arms: Theory and practice. ICML, volume 10, 511–518. Kalyanakrishnan et al. (2012) ↑ Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) Pac subset selection in stochastic multi-armed bandits. ICML, volume 12, 655–662. Katz et al. (1977) ↑ Katz R, Osborne SF, Ionescu F (1977) Application of the free-wilson technique to structurally related series of homologs. quantitative structure-activity relationship studies of narcotic analgetics. Journal of Medicinal Chemistry 20(11):1413–1419. Kaufmann et al. (2016) ↑ Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research 17:1–42. Kaufmann and Kalyanakrishnan (2013) ↑ Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Conference on Learning Theory, 228–251 (PMLR). Kim and Nelson (2001) ↑ Kim SH, Nelson BL (2001) A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS) 11(3):251–273. Kim and Nelson (2006) ↑ Kim SH, Nelson BL (2006) Selecting the best system. Handbooks in operations research and management science 13:501–534. Koenig and Law (1985) ↑ Koenig LW, Law AM (1985) A procedure for selecting a subset of size m containing the l best of k independent normal populations, with applications to simulation. Communications in Statistics-Simulation and Computation 14(3):719–734. Komiyama et al. (2023) ↑ Komiyama J, Ariu K, Kato M, Qin C (2023) Rate-optimal bayesian simple regret in best arm identification. Mathematics of Operations Research . Kveton et al. (2023) ↑ Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2023) Randomized exploration in generalized linear bandits. Lattimore and Szepesvári (2020) ↑ Lattimore T, Szepesvári C (2020) Bandit algorithms (Cambridge University Press). Lattimore et al. (2020) ↑ Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Li et al. (2017) ↑ Li L, Lu Y, Zhou D (2017) Provably optimal algorithms for generalized linear contextual bandits. International Conference on Machine Learning, 2071–2080 (PMLR). Li et al. (2024) ↑ Li Z, Fan W, Hong LJ (2024) The (surprising) sample optimality of greedy procedures for large-scale ranking and selection. Management Science . Locatelli et al. (2016) ↑ Locatelli A, Gutzeit M, Carpentier A (2016) An optimal algorithm for the thresholding bandit problem. International Conference on Machine Learning, 1690–1698 (PMLR). Mannor and Tsitsiklis (2004) ↑ Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(Jun):623–648. Mason et al. (2020) ↑ Mason B, Jain L, Tripathy A, Nowak R (2020) Finding all 𝜖 -good arms in stochastic bandits. Advances in Neural Information Processing Systems 33:20707–20718. McCullagh (2019) ↑ McCullagh P (2019) Generalized linear models (Routledge). Negoescu et al. (2011) ↑ Negoescu DM, Frazier PI, Powell WB (2011) The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing 23(3):346–363. Peukert et al. (2023) ↑ Peukert C, Sen A, Claussen J (2023) The editor and the algorithm: Recommendation technology in online news. Management science . Pukelsheim (2006) ↑ Pukelsheim F (2006) Optimal design of experiments (SIAM). Qin and You (2025) ↑ Qin C, You W (2025) Dual-directed algorithm design for efficient pure exploration. Operations Research . Réda et al. (2021a) ↑ Réda C, Kaufmann E, Delahaye-Duriez A (2021a) Top-m identification for linear bandits. International Conference on Artificial Intelligence and Statistics, 1108–1116 (PMLR). Réda et al. (2021b) ↑ Réda C, Tirinzoni A, Degenne R (2021b) Dealing with misspecification in fixed-confidence linear top-m identification. Advances in Neural Information Processing Systems 34:25489–25501. Rivera and Tewari (2024) ↑ Rivera EO, Tewari A (2024) Optimal thresholding linear bandit. arXiv preprint arXiv:2402.09467 . Russo (2020) ↑ Russo D (2020) Simple bayesian algorithms for best-arm identification. Operations Research 68(6):1625–1647. Ryzhov et al. (2012) ↑ Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1):180–195. Shen et al. (2021) ↑ Shen H, Hong LJ, Zhang X (2021) Ranking and selection with covariates for personalized decision making. INFORMS Journal on Computing 33(4):1500–1519. Shin et al. (2018) ↑ Shin D, Broadie M, Zeevi A (2018) Tractable sampling strategies for ordinal optimization. Operations Research 66(6):1693–1712. Simchi-Levi et al. (2024) ↑ Simchi-Levi D, Wang C, Xu J (2024) On experimentation with heterogeneous subgroups: An asymptotic optimal 𝛿 -weighted-pac design. SSRN Electronic Journal . Soare et al. (2014) ↑ Soare M, Lazaric A, Munos R (2014) Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27. Thompson (1933) ↑ Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3-4):285–294. Thornton et al. (2013) ↑ Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 847–855. Wolke and Schwetlick (1988) ↑ Wolke R, Schwetlick H (1988) Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons. SIAM journal on scientific and statistical computing 9(5):907–921. Xu et al. (2018) ↑ Xu L, Honda J, Sugiyama M (2018) A fully adaptive algorithm for pure exploration in linear bandits. International Conference on Artificial Intelligence and Statistics, 843–851 (PMLR). Yang and Tan (2022) ↑ Yang J, Tan V (2022) Minimax optimal fixed-budget best arm identification in linear bandits. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds., Advances in Neural Information Processing Systems, volume 35, 12253–12266 (Curran Associates, Inc.). Zanette et al. (2020) ↑ Zanette A, Lazaric A, Kochenderfer M, Brunskill E (2020) Learning near optimal policies with low inherent bellman error. International Conference on Machine Learning, 10978–10989 (PMLR).

E-Companion – Identifying All 𝜀 -Best Arms In Linear Bandits With Misspecification

Appendix EC.1Additional Literature EC.1.1Misspecified Linear Bandits.

The linear bandit (LB) problem, introduced by Abe and Long (1999), extends the multi-armed bandits (MABs) framework by incorporating structural relationships among different arms. In the context of best arm identification, Garivier and Kaufmann (2016) established a classical lower bound, which was later extended to linear bandits by Fiez et al. (2019) using transportation inequalities.

The foundational study of linear bandits in the pure exploration framework was conducted by Hoffman et al. (2014), who addressed the best arm identification (BAI) problem in a fixed-budget setting while considering correlations among arm distributions. They proposed BayesGap, a Bayesian variant of the gap-based exploration algorithm (Gabillon et al. 2012). Although BayesGap outperformed methods that ignore correlations and structural relationships, its limitation of ceasing to pull arms deemed sub-optimal hindered its effectiveness in linear bandit pure exploration.

A key distinction between stochastic MABs and linear bandits is that, in MABs, once an arm’s sub-optimality is confirmed with high probability, it is no longer pulled. In linear bandits, however, even sub-optimal arms can offer valuable information about the parameter vector, improving confidence in estimates and aiding the discrimination of near-optimal arms. This insight has led to the adoption of optimal linear experiment design as a crucial framework for linear bandit pure exploration (Abbasi-Yadkori et al. 2011, Soare et al. 2014, Fiez et al. 2019, Réda et al. 2021, Yang and Tan 2021, Azizi et al. 2021a).

When applying linear models to real data, misspecification inevitably arises in situations where the data deviates from perfect linearity. The concept of misspecified bandit models was introduced in the context of cumulative regret by Ghosh et al. (2017), who demonstrated a significant limitation: any linear bandit algorithm (e.g., OFUL (Abbasi-Yadkori et al. 2011) or LinUCB (Li et al. 2010)), which achieves optimal regret bounds on perfectly linear instances, can suffer linear regret on certain misspecified models. To address this, they proposed a hypothesis-test-based algorithm that avoids linear regret and achieves UCB-type sublinear regret for models with non-sparse deviations from linearity. Lattimore et al. (2020) further analyzed misspecification, showing that elimination-based algorithms with G-optimal design perform well under misspecification but incur an additional linear regret term proportional to the misspecification magnitude over the horizon.

In the pure exploration setting, misspecified linear models were first studied in the context of identifying the top 𝑚 best arms by Réda et al. (2021), who introduced the MisLid algorithm, leveraging orthogonal parameterization to address misspecification. Subsequent research examined misspecification in ordinal optimization (Ahn et al. 2024), proposing prospective sampling methods that reduce the impact of misspecification as the sample size increases. Building on the definition of model misspecification and the optimization approach based on orthogonal parameterization from Réda et al. (2021), we develop new algorithms for identifying all 𝜀 -best arms in misspecified linear bandits and establish new upper bounds.

EC.1.2Generalized Linear Bandits.

The generalized linear bandit (GLB) model (Filippi et al. 2010, Ahn and Shin 2020, Kveton et al. 2023) extends the multi-armed bandit framework by incorporating generalized linear models (GLMs) (McCullagh 2019) to model expected rewards. Specifically, the expected reward of each arm is given by a known link function applied to the inner product of a feature vector and an unknown parameter vector. Most existing algorithms for generalized linear bandits employ the upper confidence bound (UCB) approach, with randomized GLM algorithms (Chapelle and Li 2011, Russo et al. 2018, Kveton et al. 2023) demonstrating superior performance.

In the context of pure exploration, Azizi et al. (2021b) introduced the first practical algorithm for best arm identification in generalized linear bandits, supported by theoretical analysis. Their work extends the best arm identification problem from linear models to more complex settings where the relationship between features and rewards follows generalized linear models (GLMs). Building on this foundation, we extend the pure exploration setting from best arm identification (BAI) to identifying all 𝜀 -best arms, providing analogous analyses and theoretical results for GLMs.

Appendix EC.2General Pure Exploration Model

In this section, we present a brief discussion about the general pure exploration problem. A more comprehensive explanation can be found in Qin and You (2025).

The decision-maker seeks to answer a query concerning the mean parameters 𝝁 by adaptively allocating the sampling budget across the available arms. This query typically involves identifying a subset of arms that satisfy certain criteria, and the goal is to determine the correct answer with high probability. Let ℐ

ℐ ( 𝝁 ) denote the correct answer, which in our setting corresponds to the set of all 𝜀 -best arms. Define 𝑀 ~ as the set of parameters that yield a unique answer. Let Ξ represent the collection of all possible answers, and for each ℐ ′ ∈ Ξ , define 𝑀 ℐ ′ ≔ { 𝜗 ∈ 𝑀 ~ : ℐ ( 𝜗 )

ℐ ′ } as the set of parameters for which ℐ ′ is the correct answer. The overall parameter space of interest is then given by 𝑀 ≔ ⋃ ℐ ′ ∈ Ξ 𝑀 ℐ ′ .

Recall that an algorithm is defined as a triplet ℋ

( 𝐴 𝑡 , 𝜏 𝛿 , 𝑎 ^ 𝜏 ) . The algorithm’s sample complexity is quantified by the number of samples, denoted as 𝜏 𝛿 , at the point of termination. The objective is to formulate algorithms that minimize the expected sample complexity 𝔼 𝝁 [ 𝜏 𝛿 ] across the set ℋ . As stated in Kaufmann et al. (2016), when 𝛿 ∈ ( 0 , 1 ) , the non-asymptotic problem complexity of an instance 𝝁 can be defined as

𝜅 ( 𝝁 ) ≔ inf 𝐴 𝑙 𝑔 𝑜 ∈ ℋ 𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 / 2.4 𝛿 ) .

(EC.2.1)

This instance-dependent complexity indicates the smallest possible constant such that the expected sample complexity 𝔼 𝝁 [ 𝜏 𝛿 ] scales in alignment with log ⁡ ( 1 / 2.4 𝛿 ) . The problem complexity 𝜅 ( 𝝁 ) is subject to an information-theoretic lower bound. This lower bound can be expressed as the optimal solution of an allocation problem, which we present in Proposition 3.1. To build this framework, we next introduce three important concepts: culprits, alternative sets, and 𝐶 𝑥 function.

EC.2.1Culprits and Alternative Sets

Let 𝒳 ( 𝝁 ) denote the set of culprits under the true mean vector 𝝁 . These culprits are responsible for deviations from the correct answer ℐ ( 𝝁 ) . The structure of 𝒳 ( 𝝁 ) varies depending on the specific exploration task, and identifying these culprits is essential for characterizing the problem’s complexity and guiding the design of effective algorithms.

To identify the correct answer, an algorithm must distinguish among different instances within the parameter space 𝑀 . Accordingly, for any instance 𝝁 ∈ 𝑀 , the instance-dependent problem complexity 𝜅 ( 𝝁 ) is determined by the structure of the corresponding alternative set

∪ 𝑥 ∈ 𝒳 ( 𝝁 ) Alt 𝑥 ( 𝝁 )

Alt ( 𝝁 ) ≔ { 𝜗 ∈ 𝑀 : ℐ ( 𝜗 ) ≠ ℐ ( 𝝁 ) } ,

(EC.2.2)

which represents the set of parameters that return a solution that is different from the correct solution ℐ ( 𝝁 ) .

As an example, consider the task of identifying the single best arm. In this case, the culprit set is 𝒳 ( 𝝁 )

[ 𝐾 ]
{ 𝐼 ∗ ( 𝝁 ) } , consisting of all arms except the current best arm. For each culprit 𝑥 ∈ 𝒳 ( 𝝁 ) , if there exists a parameter 𝜗 under which arm 𝑥 has a higher mean than 𝐼 ∗ ( 𝜗 ) , then 𝜗 leads to an incorrect identification caused by 𝑥 . Each such culprit is associated with an alternative set—namely, the set of parameters that yield a wrong answer due to 𝑥 —given by Alt 𝑥 ( 𝝁 )

{ 𝜗 ∈ 𝑀 : 𝜗 𝑥 ≥ 𝜗 𝐼 ∗ ( 𝜗 ) } for 𝑥 ∈ 𝒳 ( 𝝁 ) .

EC.2.2 𝑪 𝒙 unction

The task of identifying the correct answer can be formulated as a sequential hypothesis testing problem, which can be addressed using the Sequential Generalized Likelihood Ratio (SGLR) test (Kaufmann et al. 2016, Kaufmann and Koolen 2021). The SGLR statistic is defined to test a potentially composite null hypothesis 𝐻 0 : ( 𝝁 ∈ Ω 0 ) against a potentially composite alternative hypothesis 𝐻 1 : ( 𝝁 ∈ Ω 1 ) , and is given by:

SGLR 𝑡

sup 𝜗 ∈ Ω 0 ∪ Ω 1 𝐿 ( 𝑋 1 , 𝑋 2 , … , 𝑋 𝑡 ; 𝜗 ) sup 𝜗 ∈ Ω 0 𝐿 ( 𝑋 1 , 𝑋 2 , … , 𝑋 𝑡 ; 𝜗 ) ,

(EC.2.3)

where 𝑋 1 , 𝑋 2 , … , 𝑋 𝑡 are the observed values from arm pulls, and 𝐿 ( ⋅ ) denotes the likelihood function based on these observations and an unknown parameter 𝜗 . The set Ω 0 corresponds to the restricted parameter space under the null hypothesis, while Ω 0 ∪ Ω 1 defines the full parameter space under consideration, encompassing both the null and alternative hypotheses. These sets correspond to the alternative regions introduced in the previous section. A large value of SGLR 𝑡 indicates stronger evidence against the null hypothesis and supports rejecting it.

We consider distributions from a single-parameter exponential family parameterized by their means, following the formulation in Garivier and Kaufmann (2016). This family includes the Bernoulli, Poisson, and Gamma distributions with known shape parameters, as well as the Gaussian distribution with known variance. For each culprit 𝑥 ∈ 𝒳 ( 𝝁 ^ ) , where 𝝁 ^ is the empirical mean based on observed data, we test the hypotheses: 𝐻 0 , 𝑥 : 𝝁 ∈ Alt 𝑥 ( 𝝁 ^ ) versus 𝐻 1 , 𝑥 : 𝝁 ∉ Alt 𝑥 ( 𝝁 ^ ) . When 𝝁 ^ ( 𝑡 ) ∈ Ω 0 ∪ Ω 1 , the generalized likelihood ratio statistic in equation (EC.2.3) can be expressed in terms of a self-normalized sum. This leads to a formal expression of the SGLR statistic in Proposition EC.1, derived through maximum likelihood estimation and a reformulation of the KL divergence.

Proposition EC.1 (Kaufmann and Koolen (2021))

The generalized likelihood ratio statistic for each culprit 𝑥 ∈ 𝒳 ( 𝛍 ) at time step 𝑡 is defined as

Λ ^ 𝑡 , 𝑥

ln ⁡ ( SGLR 𝑡 )

inf 𝜗 ∈ Alt 𝑥 ( 𝝁 ^ ( 𝑡 ) ) ∑ 𝑖 ∈ [ 𝐾 ] 𝑁 𝑖 ( 𝑡 ) KL ( 𝜇 ^ 𝑖 ( 𝑡 ) , 𝜗 𝑖 ) ,

(EC.2.4)

where KL ( ⋅ , ⋅ ) represents the KL divergence of the two distributions parameterized by their means, and 𝑁 𝑖 ( 𝑡 )

𝑡 ⋅ 𝑝 𝑖 is the expected number of observations allocated to arm 𝑖 ∈ [ 𝐾 ] up to time 𝑡 .

This proposition links the SGLR test to information-theoretic methods. To quantify the information and confidence required to assert that the true mean does not lie in Alt 𝑥 for all 𝑥 ∈ 𝒳 , we define the 𝐶 𝑥 function as the population version of the SGLR statistic, sharing the same form as equation (EC.2.4).

𝐶 𝑥 ( 𝒑 )

𝐶 𝑥 ( 𝒑 ; 𝝁 ) ≔ inf 𝜗 ∈ Alt 𝑥 ∑ 𝑖 ∈ [ 𝐾 ] 𝑝 𝑖 KL ( 𝜇 𝑖 , 𝜗 𝑖 ) .

(EC.2.5)

With the introduction of culprits and the 𝐶 𝑥 function, we arrive at the optimal allocation problem that defines the lower bound stated in Proposition 3.1. However, computing the lower bound can still be hard since it requires the solution of the minimax problem in equation (18). While the KL divergence in equation (EC.2.5) is convex for Gaussians, it can be non-convex to minimize the 𝐶 𝑥 function over the culprit set 𝒳 ( 𝝁 ) . To solve this problem, based on the following Proposition EC.2, we can write 𝒳 ( 𝝁 ) as a union of several convex sets. The following three equivalent expressions represent different ways of describing the lower bound, making the minimax problem in equation (18) tractable for every 𝑥 ∈ 𝒳 .

Γ 𝝁 ∗

max 𝒑 ∈ 𝒮 𝐾 inf 𝜗 ∈ Alt ⁡ ( 𝝁 ) ∑ 𝑖 ∈ [ 𝐾 ] 𝑝 𝑖 KL ( 𝜇 𝑖 , 𝜗 𝑖 )

max 𝒑 ∈ 𝒮 𝐾 ⁡ min 𝑥 ∈ 𝒳 inf 𝜗 ∈ Alt 𝑥 ∑ 𝑖 ∈ [ 𝐾 ] 𝑝 𝑖 KL ( 𝜇 𝑖 , 𝜗 𝑖 )

(EC.2.6)

= max 𝒑 ∈ 𝒮 𝐾 ⁡ min 𝑥 ∈ 𝒳 ∑ 𝑖 ∈ [ 𝐾 ] 𝑝 𝑖 KL ( 𝜇 𝑖 , 𝜗 𝑖 𝑥 )

(EC.2.7)

= max 𝒑 ∈ 𝒮 𝐾 ⁡ min 𝑥 ∈ 𝒳 ⁡ 𝐶 𝑥 ( 𝒑 ) ,

(EC.2.8)

where we utilized the existence of a finite union set and a unique minimizer 𝜗 𝑥 in Proposition EC.2.

Proposition EC.2

Assume that the distribution of each arm belongs to a canonical single-parameter exponential family, parameterized by its mean. Then, for each culprit 𝑥 ∈ 𝒳 ( 𝛍 ) ,

1.

(Wang et al. 2021) For each problem instance 𝝁 ∈ 𝑀 , the alternative set Alt ( 𝝁 ) is a finite union of convex sets. Namely, there exists a finite collection of convex sets { Alt 𝑥 ( 𝝁 ) : 𝑥 ∈ 𝒳 ( 𝝁 ) } such that Alt ( 𝝁 )

∪ 𝑥 ∈ 𝒳 ( 𝝁 ) Alt 𝑥 ( 𝝁 ) .

2.

Given a specific simplex distribution 𝒑 , there exists a unique 𝜗 𝑥 ∈ Alt 𝑥 ( 𝝁 ) that achieves the infimum in equation (EC.2.5).

Proof EC.3

Proof. The proof proceeds in two parts. For the first part, the alternative set for any given culprit 𝑥 in our setting is given by

Alt 𝑥 ( 𝝁 )

Alt 𝑖 , 𝑗 ( 𝝁 ) ∪ Alt 𝑚 ( 𝝁 ) ,

(EC.2.9)

where Alt 𝑖 , 𝑗 ( 𝛍 ) and Alt 𝑚 ( 𝛍 ) are defined in (EC.4.10) and (EC.4.19), respectively. Each Alt 𝑥 ( 𝛍 ) is convex, as convexity is preserved under unions of convex sets. This can be verified by confirming that any convex combination of two points in Alt 𝑥 ( 𝛍 ) remains in the set.

For the second part, when the reward distribution belongs to a single-parameter exponential family, the KL divergence KL ( 𝜒 , 𝜒 ′ ) is continuous and strictly convex in ( 𝜒 , 𝜒 ′ ) . This ensures that the infimum in equation (EC.2.5) is achieved uniquely by a single 𝜗 . \Halmos

EC.2.3Stopping Rule

In this section, we introduce the stopping rule, which suggests when to stop the algorithm and returns an answer that gives all 𝜀 -best arms with a probability of at least 1 − 𝛿 . This stopping rule is based on the deviation inequalities that are linked to the generalized likelihood ratio test (Kaufmann and Koolen 2021).

For each 𝑥 𝑖 ∈ 𝒳 ( 𝝁 ) with 𝑖 ∈ [ | 𝒳 | ] , let 𝑀 𝑖 ( 𝝁 )

Alt 𝑥 𝑖 ( 𝝁 ) denote a partition of the realizable parameter space 𝑀 introduced in Section EC.2.1, where each partition 𝑀 𝑖 ( 𝝁 ) is associated with a distinct culprit in 𝒳 ( 𝝁 ) . This implies that for any 𝝁 ∈ 𝑀 , the parameter space 𝑀 can be uniquely partitioned to support a hypothesis testing framework. Let 𝑀 0 ( 𝝁 ) denote the subset of parameters where 𝝁 resides. If 𝝁 ∈ 𝑀 , define 𝑖 ∗ ( 𝝁 ) as the index of the unique element in the partition where the true mean value 𝝁 belongs, and here 𝑖 ∗ ( 𝝁 )

0 . In other words, we have 𝝁 ∈ 𝑀 0 and Alt ( 𝝁 )

𝑀
𝑀 0 . Since the ordering among suboptimal arms is irrelevant, the sets 𝑀 𝑖 ( 𝝁 ) for 𝑖 ∈ { 0 , 1 , 2 , … , | 𝒳 | } form a valid partition of 𝑀 for each 𝝁 . Accordingly, the alternative set can be further defined as

Alt ( 𝝁 )

∪ 𝑖 : 𝝁 ∉ 𝑀 𝑖 ( 𝝁 ) 𝑀 𝑖 ( 𝝁 )

𝑀
𝑀 𝑖 ∗ ( 𝝁 )

𝑀
𝑀 0 .

(EC.2.10)

Given a bandit instance 𝝁 , we consider a total of | 𝒳 ( 𝝁 ) | + 1 hypotheses, defined as

𝐻 0

( 𝝁 ∈ 𝑀 0 ( 𝝁 ) ) , 𝐻 1

( 𝝁 ∈ 𝑀 1 ( 𝝁 ) ) , … , 𝐻 | 𝒳 |

( 𝝁 ∈ 𝑀 | 𝒳 | ( 𝝁 ) ) .

(EC.2.11)

By substituting the true mean vector 𝝁 with its empirical estimate 𝝁 ^ ( 𝑡 ) , the SGLR test becomes data-dependent, relying on the empirical means at each time step 𝑡 . Consequently, the hypotheses tested at time 𝑡 are also data-dependent. If 𝝁 ^ ( 𝑡 ) ∈ 𝑀 , we define 𝑖 ^ ( 𝑡 )

𝑖 ∗ ( 𝝁 ^ ( 𝑡 ) ) as the index of the partition to which 𝝁 ^ ( 𝑡 ) belongs—that is, 𝝁 ^ ( 𝑡 ) ∈ 𝑀 𝑖 ^ ( 𝑡 ) . If instead 𝝁 ^ ( 𝑡 ) ∉ 𝑀 , we set Λ ^ 𝑡 , 𝑥

0 for all culprits 𝑥 , meaning no hypothesis test is conducted at that time step, and the process continues. In practice, when 𝝁 ^ ( 𝑡 ) ∉ 𝑀 , the algorithm can revert to uniform exploration. Since the true mean vector 𝝁 ∈ 𝑀 and 𝑀 is assumed to be an open set, the law of large numbers guarantees that 𝝁 ^ ( 𝑡 ) will eventually re-enter the parameter space, i.e., 𝝁 ^ ( 𝑡 ) ∈ 𝑀 after sufficient samples.

We run | 𝒳 | time-varying SGLR tests in parallel, each testing 𝐻 0 against 𝐻 𝑖 for 𝑖 ∈ [ | 𝒳 ( 𝝁 ^ ( 𝑡 ) ) | ] . The procedure stops when any of these tests rejects 𝐻 0 , indicating that the corresponding alternative set is empirically the easiest to reject. At this point, the accepted hypothesis for 𝝁 ^ ( 𝑡 ) ∈ 𝑀 is identified as the most likely to be correct. Given a sequence of exploration rates ( 𝛽 ^ 𝑡 ( 𝛿 ) ) 𝑡 ∈ ℕ , the SGLR stopping rule in the pure exploration setting is defined as follows:

𝜏 𝛿 ≔ inf { 𝑡 ∈ ℕ : min 𝑥 ∈ 𝒳 ( 𝝁 ^ ( 𝑡 ) ) ⁡ Λ ^ 𝑡 , 𝑥 > 𝛽 ^ 𝑡 ( 𝛿 ) }

inf { 𝑡 ∈ ℕ : 𝑡 ⋅ min 𝑥 ∈ 𝒳 ( 𝝁 ^ ( 𝑡 ) ) ⁡ 𝐶 𝑥 ( 𝒑 𝑡 ; 𝝁 ^ ( 𝑡 ) )

𝛽 ^ 𝑡 ( 𝛿 ) } ,

(EC.2.12)

where the SGLR statistic Λ ^ 𝑡 , 𝑥 is defined in equation (EC.2.4). The testing process closely resembles the classical approach, except that the hypotheses are data-dependent and evolve over time.

We also provide insights from the perspective of the confidence region. Specifically, the event { min 𝑥 ∈ 𝒳 ( 𝝁 ^ ( 𝑡 ) ) ⁡ Λ ^ 𝑡 , 𝑥 > 𝛽 ^ 𝑡 ( 𝛿 ) }

{ 𝒞 𝑡 , 𝛿 ⊆ 𝑀 𝑖 ^ ( 𝑡 ) } , where 𝒞 𝑡 , 𝛿 denotes the confidence region of the mean vector, given by

𝒞 𝑡 , 𝛿 ≔ { 𝜗 : ∑ 𝑖

1 𝐾 𝑁 𝑖 ( 𝑡 ) KL ( 𝜇 ^ 𝑖 ( 𝑡 ) , 𝜗 𝑖 ) ≤ 𝛽 ^ 𝑡 ( 𝛿 ) } .

(EC.2.13)

Notably, although the confidence region 𝒞 𝑡 , 𝛿 is defined for the mean vector 𝝁 , under the assumption of linear structure, it is equivalent to the confidence region for the parameter 𝜽 , as discussed in Sections 2.2 and EC.4.1. This equivalence allows the stopping rule to be interpreted as follows: the algorithm halts once the confidence region for the mean vector fully lies within a single partition region, aligning with the graphical interpretation of optimal allocation given in Section EC.4.2.

Appendix EC.3Difference between G-Optimal Design and 𝒳 𝒴 -Optimal Design Figure EC.3.1:Key Distinction Between G-Optimal and 𝒳 𝒴 -Optimal Designs in Terms of Stopping Criteria

Note. The contraction behavior and rate of the confidence region for the parameter 𝛉 ^ 𝑡 differ: under G-optimal sampling (left), the region shrinks uniformly in all directions, whereas under 𝒳 𝒴 -optimal design (right), it contracts more strategically along directions critical for classification, allowing the confidence region to enter a decision region more rapidly and trigger earlier stopping.

Returning to the intuition and visual explanation in Section EC.4.1 and Figure EC.4.1, we see that G-optimal sampling inevitably leads to inefficient sampling. Figure EC.3.1 illustrates the advantage of adopting the 𝒳 𝒴 -optimal design from the perspective of the stopping rule: the algorithm terminates once the yellow confidence region for 𝜽 ^ enters one of the decision regions 𝑀 𝑖 ( 𝑖

1 , … , 7 ). Unlike the isotropic shrinkage of the confidence region under G-optimal design, the 𝒳 𝒴 -optimal design guides the region to contract more aggressively in directions critical for distinguishing arms. Rather than uniformly estimating 𝜽 , it prioritizes reducing uncertainty along directions that matter most for classification, leading to more efficient exploration.

Appendix EC.4Lower Bound for All 𝜀 -Best Arms Identification in Linear Bandits

This section provides both geometric insights regarding the stopping condition and formal proofs establishing the lower bound for identifying all 𝜀 -best arms in linear bandit settings.

EC.4.1Visual Illustration of the Stopping Condition Figure EC.4.1:Visual Illustration of Identifying the Best Arm vs. Identifying All 𝜀 -Best Arms.

Note. (a) Stopping occurs when the confidence region 𝒞 𝑡 , 𝛿 for the estimated parameter 𝛉 ^ 𝑡 contracts entirely within one of the three decision regions 𝑀 𝑖 in a certain time step 𝑡 . The boundaries between regions are defined by the hyperplanes 𝜗 ⊤ ( 𝐚 𝑖 − 𝐚 𝑗 )

0 . Each dot represents an arm. (b) In the case of identifying all 𝜀 -best arms, the regions overlap. (c) Due to these overlaps, the space is partitioned into seven distinct decision regions, increasing the difficulty of identification.

The stopping condition is formulated as a hypothesis test conducted as data is collected, which can be interpreted as the process of the parameter confidence region contracting into one of the decision regions (i.e., the set of parameters that yield the same decision). A more detailed version of Figure 1 is shown in Figure EC.4.1, which illustrates the core idea of the stopping condition for identifying all 𝜀 -best arms in linear bandits. The key distinction between identifying all 𝜀 -best arms and identifying the single best arm lies in how the decision regions are partitioned.

Figure EC.4.1(a) illustrates the best arm identification process in linear bandits, where 𝒂 ∗

𝒂 ∗ ( 𝝁 ) represents the arm with the largest mean value for each bandit instance 𝝁 . Let 𝑀 𝑖

{ 𝜗 ∈ ℝ 𝑑 ∣ 𝒂 𝑖

𝒂 ∗ } be the set of parameters 𝜽 for which 𝒂 𝑖 ( 𝑖

1 , 2 , 3 ) is the optimal arm. Each 𝑀 𝑖 forms a cone defined by the intersection of half-spaces.

Figure EC.4.1(b) represents an intermediate step, demonstrating the transition from best arm identification to identifying all 𝜀 -best arms. Let 𝑀 𝒂 𝑖

{ 𝜗 ∈ ℝ 𝑑 ∣ 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) } be the set of parameters 𝜽 include arm 𝑖 in the set 𝐺 𝜀 ( 𝝁 ) . 𝑀 𝒂 𝑖 is similarly defined by the intersection of half-spaces. The overlap of these three regions forms the decision regions 𝑀 𝑖 ( 𝑖

1 , 2 , … , 7 ) in Figure EC.4.1(c), which correspond to the seven distinct types of 𝜀 -best arms sets. Besides, the BAI process in (a) is a special case of the 𝜀 -best arms identification in (c), occurring when the gap 𝜀 approaches 0. The following statement provides a detailed explanation of how the decision regions in Figure EC.4.1(c) are constructed.

In a 𝑑 -dimensional Euclidean space ℝ 𝑑 , hyperplanes 𝑙 𝑖 , 𝑗 can be defined for any pair of arms, partitioning the space into the following half-spaces:

𝐻 𝑖 , 𝑗 +

{ 𝜗 ∈ ℝ 𝑑 ∣ ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ 𝜗

𝜀 } ,

(EC.4.1)

𝐻 𝑖 , 𝑗 −

{ 𝜗 ∈ ℝ 𝑑 ∣ ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ 𝜗 ≤ 𝜀 } .

(EC.4.2)

The hyperplane 𝑙 𝑖 , 𝑗 , which separates the half-spaces 𝐻 𝑖 , 𝑗 + and 𝐻 𝑖 , 𝑗 − , is perpendicular to the direction vector 𝒂 𝑖 − 𝒂 𝑗 . The intersection of these half-spaces, represented by ⋂ 𝑖 , 𝑗 ∈ [ 𝐾 ] , 𝑗 ≠ 𝑖 𝐻 𝑖 , 𝑗 , partitions the space into distinct regions. Each region corresponds to a solution set, representing all 𝜀 -best arms if the true parameter 𝜽 lies within that region. As the gap 𝜀 approaches 0, the hyperplanes 𝑙 𝑖 , 𝑗 on both sides of the decision boundaries move closer together, causing some decision regions in Figure EC.4.1(c) to shrink until they vanish. The relationship between the true parameter 𝜽 and the half-spaces determines the belonging of arms in the good set 𝐺 𝜀 ( 𝝁 ) .

For the case of three arms, the space is divided into three overlapping regions, as shown in Figure EC.4.1(b). These regions further generate seven ( 2 3 − 1

7 ) decision regions, denoted 𝑀 1 through 𝑀 7 , which are summarized in Table EC.4.1.

Table EC.4.1:Three Overlapping Regions and Seven Decision Regions in the Case of Three Arms Decision Region Set Expression 𝜀 -Best Arms

𝑀 1

𝑀 𝑎 1 ∩ 𝑀 𝑎 2 𝑐 ∩ 𝑀 𝑎 3 𝑐

{ 1 }

𝑀 2

𝑀 𝑎 1 ∩ 𝑀 𝑎 2 𝑐 ∩ 𝑀 𝑎 3

{ 1 , 3 }

𝑀 3

𝑀 𝑎 1 𝑐 ∩ 𝑀 𝑎 2 𝑐 ∩ 𝑀 𝑎 3

{ 3 }

𝑀 4

𝑀 𝑎 1 𝑐 ∩ 𝑀 𝑎 2 ∩ 𝑀 𝑎 3

{ 2 , 3 }

𝑀 5

𝑀 𝑎 1 𝑐 ∩ 𝑀 𝑎 2 ∩ 𝑀 𝑎 3 𝑐

{ 2 }

𝑀 6

𝑀 𝑎 1 ∩ 𝑀 𝑎 2 ∩ 𝑀 𝑎 3 𝑐

{ 1 , 2 }

𝑀 7

𝑀 𝑎 1 ∩ 𝑀 𝑎 2 ∩ 𝑀 𝑎 3

{ 1 , 2 , 3 }

The stopping condition verifies whether the confidence region 𝒞 𝑡 , 𝛿 is entirely contained within a specific decision region 𝑀 𝑖 . It is important to note that, due to the definition of the confidence region and the property that 𝜽 ^ 𝑡 → 𝜽 as 𝑡 → ∞ , any algorithm that continually samples all arms will eventually meet the stopping condition.

EC.4.2Optimal Allocation

The goal of the sampling policy is to construct an allocation sequence that drives the confidence set 𝐶 𝑡 , 𝛿 into the optimal region 𝑀 ∗ as efficiently as possible. Geometrically, this entails selecting arms that cause 𝐶 𝑡 , 𝛿 to contract into the optimal cone 𝑀 ∗ with minimal sampling effort. The condition 𝒞 𝑡 , 𝛿 ⊆ 𝑀 ∗ can be expressed as

For all 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) and ∀ 𝜗 ∈ 𝒞 𝑡 , 𝛿 , we have 𝜗 ∈ 𝐻 𝑗 , 𝑖 − and 𝜗 ∈ 𝐻 1 , 𝑚 + .

(EC.4.3)

In words, every parameter vector 𝜗 that remains plausible must preserve all required pairwise orderings: no 𝜀 -optimal arm 𝑖 can be overtaken by any rival 𝑗 , and the best arm 1 must stay ahead of every suboptimal arm 𝑚 . Equivalently, no 𝜗 ∈ 𝒞 𝑡 , 𝛿 is allowed to flip these comparisons.

The relationships 𝜗 ∈ 𝐻 𝑗 , 𝑖 − and 𝜗 ∈ 𝐻 1 , 𝑚 + are equivalent to the following inequalities by adding terms on both sides of the inequalities (EC.4.1) and (EC.4.2) and reorganizing:

{ ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ ( 𝜽 − 𝜗 ) ≤ ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ 𝜽 + 𝜀

( 𝒂 1 − 𝒂 𝑚 ) ⊤ ( 𝜽 − 𝜗 ) < ( 𝒂 1 − 𝒂 𝑚 ) ⊤ 𝜽 − 𝜀 .

(EC.4.4)

Now, we focus on the confidence region, which can be constructed following Cauchy’s inequality and the definition of the confidence ellipse for parameter 𝜽 in equation (10).

𝒞 𝑡 , 𝛿

{ 𝜗 ∈ ℝ 𝑑 | ∀ 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) , { ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ ( 𝜽 − 𝜗 ) ≤ ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝑡 − 1 𝐵 𝑡 , 𝛿

( 𝒂 1 − 𝒂 𝑚 ) ⊤ ( 𝜽 − 𝜗 ) ≤ ‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝑡 − 1 𝐵 𝑡 , 𝛿

} ,

(EC.4.5)

where 𝑽 𝑡 is the information matrix as defined in equation (9) and the confidence bound for parameter 𝜽 , i.e., 𝐵 𝑡 , 𝛿 , can either be a fixed confidence bound as shown in Proposition 2.3 or a looser adaptive confidence bound introduced in Abbasi-Yadkori et al. (2011). The stopping condition 𝒞 𝑡 , 𝛿 ⊆ 𝑀 ∗ can thus be reformulated. For each 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) , we have

{ ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝑡 − 1 𝐵 𝑡 , 𝛿 ≤ ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ 𝜽 + 𝜀

‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝑡 − 1 𝐵 𝑡 , 𝛿 ≤ ( 𝒂 1 − 𝒂 𝑚 ) ⊤ 𝜽 − 𝜀 .

(EC.4.6)

If equation (EC.4.6) holds, then for any 𝜽 ∈ 𝒞 𝑡 , 𝛿 , equation (EC.4.4) also holds, implying that 𝒞 𝑡 , 𝛿 ⊆ 𝑀 ∗ . Given that ( 𝒂 𝑖 − 𝒂 𝑗 ) ⊤ 𝜽 + 𝜀

0 and ( 𝒂 1 − 𝒂 𝑚 ) ⊤ 𝜽 − 𝜀

0 , and rearranging (EC.4.6), the oracle allocation strategy is determined as follows:

{ 𝒂 𝐴 𝑛 } ∗

arg ⁡ min { 𝒂 𝐴 𝑛 } ⁡ max 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) ⁡ max ⁡ { 2 ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝑡 − 1 2 ( 𝒂 𝑖 ⊤ 𝜽 − 𝒂 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝑡 − 1 2 ( 𝒂 1 ⊤ 𝜽 − 𝒂 𝑚 ⊤ 𝜽 − 𝜀 ) 2 } ,

(EC.4.7)

where { 𝒂 𝐴 𝑛 }

( 𝒂 𝐴 1 , 𝒂 𝐴 1 , … , 𝒂 𝐴 𝑛 ) ∈ 𝒜 𝑛 is a sequence of sampled arms and { 𝒂 𝐴 𝑛 } ∗ is the oracle allocation strategy5. However, it is more convenient to demonstrate the sample complexity of the problem in terms of the continuous allocation proportion 𝒑 instead of the discrete allocation sequence { 𝒂 𝐴 𝑛 } . Then, we can have the following optimal allocation proportion.

𝒑 ∗

arg ⁡ min 𝒑 ∈ 𝒮 𝐾 ⁡ max 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) ⁡ max ⁡ { 2 ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 𝑖 ⊤ 𝜽 − 𝒂 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 1 ⊤ 𝜽 − 𝒂 𝑚 ⊤ 𝜽 − 𝜀 ) 2 } ,

(EC.4.8)

where 𝒮 𝐾 denotes the 𝐾 -dimensional probability simplex, and 𝑽 𝒑

∑ 𝑖

1 𝐾 𝑝 𝑖 𝒂 𝑖 𝒂 𝑖 ⊤ is the weighted information matrix, analogous to the Fisher information matrix (Chaloner and Verdinelli 1995). The intuition behind this optimal allocation strategy is as follows: to satisfy the inequality in (EC.4.6) as quickly as possible, the ratio of the left-hand side to the right-hand side should be minimized, leading to the outer minimization operation over the allocation probability 𝒑 . The middle maximization operation accounts for the fact that the inequalities in (EC.4.6) must hold for each possible triple ( 𝑖 , 𝑗 , 𝑚 ) , which specifies the culprit set 𝒳

𝒳 ( 𝝁 ) (see Proposition 3.1 for a formal definition). Furthermore, since both inequalities in (EC.4.6) need to be satisfied, we take inner maximization operations to enforce this requirement.

EC.4.3Proof of the Lower Bound in Theorem 3.2 Proof EC.1

Proof. Deriving the lower bound requires constructing the largest possible alternative set and identifying the most challenging instance for distinguishing the arms. Specifically, to construct an alternative problem instance that possesses a distinct set of 𝜀 -best arms from the original problem instance 𝛍 , we systematically perturb the expected rewards of specific arms. This process can be achieved through two ways of construction, as further detailed in Figure EC.4.2:

•

Lowering an 𝜀 -Best Arm: This approach decreases the expected reward of a currently 𝜀 -best arm 𝑖 , while simultaneously increasing the expected reward of another arm 𝑗 .

•

Elevating a Non- 𝜀 -Best Arm: This approach increases the expected reward of a non- 𝜀 -best arm 𝑚 , while concurrently reducing the expected rewards of the ℓ top-performing arms.

Figure EC.4.2:Illustration of Constructing Alternative Set of All 𝜀 -Best Arms Identification Lower Bound

Note. (a) Transforming an arm within 𝐺 𝜀 ( 𝝁 ) into a non- 𝜀 -best arm by decreasing the mean of an 𝜀 -best arm 𝑖 while increasing the mean of another arm 𝑗 . (b) Transforming an arm outside 𝐺 𝜀 ( 𝝁 ) into an 𝜀 -best arm by increasing the mean of a non- 𝜀 -best arm 𝑚 and simultaneously decreasing the means of higher-performing arms.

For all 𝜀 -best arms identification in linear bandits, we establish the correct answer, the culprit set, and the alternative set as follows. The general idea of the proof is to builds these alternatives explicitly and shows which one is the hardest to distinguish. Recall that 𝜇 𝑖

⟨ 𝛉 , 𝐚 𝑖 ⟩ . For notational simplicity, we use 𝛍 to represent the mean vector induced by true parameter 𝛉 .

•

The correct answer 𝐺 𝜀 ( 𝝁 )

{ 𝑖 : ⟨ 𝜽 , 𝒂 𝑖 ⟩ ≥ max 𝑗 ⁡ ⟨ 𝜽 , 𝒂 𝑗 ⟩ − 𝜀 } for all 𝑖 ∈ [ 𝐾 ] . These are the 𝜀 -best arms under the true parameter.

•

The culprit set 𝒳 ( 𝝁 )

{ ( 𝑖 , 𝑗 , 𝑚 , ℓ ) : 𝑖 ∈ 𝐺 𝜀 ( 𝝁 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝝁 ) , ℓ ∈ { 1 , 2 , … , 𝑚 − 1 } } . The culprit set identifies potential sources of error. Intuitively, each tuple corresponds to two types of mistakes that can affect the output as mentioned above.

•

The alternative set Alt ( 𝝁 )

∪ 𝑥 ∈ 𝒳 ( 𝝁 ) Alt 𝑥 ( 𝝁 ) and the form of Alt 𝑥 ( 𝝁 ) is given by

Alt 𝑥 ( 𝝁 )

Alt 𝑖 , 𝑗 ( 𝝁 ) ∪ Alt 𝑚 , ℓ ( 𝝁 ) for all 𝑥 ∈ 𝒳 ( 𝝁 ) .

(EC.4.9)

So for each culprit 𝑥 , we build two kinds of alternative instances, each capable of flipping the answer. The two parts of the alternative sets can be expressed as

Alt 𝑖 , 𝑗 ( 𝝁 )

{ 𝜗 : ⟨ 𝜗 , 𝒂 𝑖 − 𝒂 𝑗 ⟩ < − 𝜀 } for all 𝑥 ∈ 𝒳 ( 𝝁 )

(EC.4.10)

and

Alt 𝑚 , ℓ ( 𝝁 )

{ 𝜗 : 𝜗 ⊤ 𝒂 1

⋯

𝜗 ⊤ 𝒂 ℓ ≥ 𝜗 ⊤ 𝒂 𝑚 + 𝜀

𝜗 ⊤ 𝒂 ℓ + 1 } for all 𝑥 ∈ 𝒳 ( 𝝁 ) ,

(EC.4.11)

which are the same as shown in the figure.

The first part Alt 𝑖 , 𝑗 ( 𝛍 ) forces a good arm 𝑖 to become at least 𝜀 worse than arm 𝑗 , causing 𝑖 to fall out of the 𝜀 -best set. In contrast, the second part ensures that a suboptimal arm is no more than 𝜀 worse than the top ℓ arms with identical mean values, thereby including it in the 𝜀 -best set.

For a given culprit 𝑥

( 𝑖 , 𝑗 , 𝑚 , ℓ ) , the first component of the alternative set can be written as

Alt 𝑖 , 𝑗 ( 𝝁 )

{ 𝜗 𝑖 , 𝑗 ( 𝜀 , 𝒑 , 𝛼 ) ∣ 𝜗 𝑖 , 𝑗 ( 𝜀 , 𝒑 , 𝛼 )

𝜽 − 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 + 𝛼 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 𝑽 𝒑 − 1 𝒚 𝑖 , 𝑗 } ,

(EC.4.12)

where 𝐕 𝐩

∑ 𝑖

1 𝐾 𝑝 𝑖 𝐚 𝑖 𝐚 𝑖 ⊤ is related to the Fisher information matrix (Chaloner and Verdinelli 1995), 𝐲 𝑖 , 𝑗

𝐚 𝑖 − 𝐚 𝑗 , and 𝛼

0 is a perturbation parameter. The form of this alternative set is derived from the solution to the following optimization problem:

arg ⁡ min 𝜗 ∈ ℝ 𝑑

‖ 𝜗 − 𝜽 ‖ 𝑽 𝒑 2

(EC.4.13)

s.t.

𝒚 𝑖 , 𝑗 ⊤ 𝜗

− 𝜀 − 𝛼 .

(EC.4.14)

Here, 𝛼 is introduced to construct the specific alternative set, providing an explicit expression for the alternative parameter 𝜗 . By letting 𝛼 → 0 , we realize the infimum in equation (EC.2.5) and equation (EC.2.6). Then, we have

𝒚 𝑖 , 𝑗 ⊤ 𝜗 𝑖 , 𝑗 ( 𝜀 , 𝒑 , 𝛼 )

− 𝜀 − 𝛼 < − 𝜀 ,

(EC.4.15)

which satisfies the condition for being a parameter in an alternative set as defined in equation (EC.4.10). Under the Gaussian distribution assumption, i.e., 𝜇 𝑖 ∼ 𝒩 ( 𝐚 𝑖 ⊤ 𝛉 , 1 ) , the KL divergence between the mean value associated with the true parameter 𝛉 and that associated with the alternative parameter 𝜗 𝑖 , 𝑗 is

KL ( 𝒂 𝑖 ⊤ 𝜽 , 𝒂 𝑖 ⊤ 𝜗 𝑖 , 𝑗 )

( 𝒂 𝑖 ⊤ ( 𝜽 − 𝜗 𝑖 , 𝑗 ( 𝜀 , 𝒑 , 𝛼 ) ) ) 2 2 ( 1 ) 2

𝒚 𝑖 , 𝑗 ⊤ 𝑽 𝒑 − 1 ( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 + 𝛼 ) 2 𝒂 𝑖 𝒂 𝑖 ⊤ 2 ( ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ) 2 𝑽 𝒑 − 1 𝒚 𝑖 , 𝑗 .

(EC.4.16)

The last equation follows from substituting the expression for 𝜗 𝑖 , 𝑗 given in (EC.4.12). Then, by Proposition 3.1 and the definition of the 𝐶 𝑥 function in equation (EC.2.5), the lower bound can be expressed as

𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 / 2.4 𝛿 )
≥ min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝜗 ∈ Alt ( 𝝁 ) ⁡ 1 ∑ 𝑛

1 𝐾 𝑝 𝑛 KL ( 𝒂 𝑛 ⊤ 𝜽 , 𝒂 𝑛 ⊤ 𝜗 )

≥ min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑥 ∈ 𝒳 sup 𝛼 > 0 1 ∑ 𝑛

1 𝐾 𝑝 𝑛 KL ( 𝒂 𝑛 ⊤ 𝜽 , 𝒂 𝑛 ⊤ 𝜗 𝑖 , 𝑗 ( 𝜀 , 𝒑 , 𝛼 ) )

min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑥 ∈ 𝒳 ⁡ 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 ) 2 .

(EC.4.17)

Note that we let 𝛼 → 0 establish the result by realizing sup 𝛼

0 . From this lower bound, we can also define the 𝐶 𝑥 function for all 𝜀 -best arms identification in linear bandits, which is

𝐶 𝑖 , 𝑗 ( 𝒑 )

( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 ) 2 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 .

(EC.4.18)

For the second part, i.e., Alt 𝑚 , ℓ ( 𝛍 ) , we assume that the mean value of arm 1 (i.e., the best arm) remains fixed. This assumption partially sacrifices the completeness of the alternative set but allows the alternative set to be expressed in an explicit and symmetric form. This form remains tight.

Consequently, the culprit set is redefined as 𝒳 ( 𝛍 )

{ ( 𝑖 , 𝑗 , 𝑚 ) : 𝑖 ∈ 𝐺 𝜀 ( 𝛍 ) , 𝑗 ≠ 𝑖 , 𝑚 ∉ 𝐺 𝜀 ( 𝛍 ) } and the alternative set can be decomposed as Alt 𝑥 ( 𝛍 )

Alt 𝑖 , 𝑗 ( 𝛍 ) ∪ Alt 𝑚 ( 𝛍 ) for a given culprit 𝑥

( 𝑖 , 𝑗 , 𝑚 ) . Different from equation (EC.4.11), the second part of the alternative set becomes

Alt 𝑚 ( 𝝁 )

{ 𝜗 : ⟨ 𝜗 , 𝒂 1 − 𝒂 𝑚 ⟩ < − 𝜀 } for all 𝑥 ∈ 𝒳 ( 𝝁 ) .

(EC.4.19)

Similarly, Alt 𝑚 ( 𝛍 ) can be constructed as the set in equation (EC.4.12), given by

Alt 𝑚 ( 𝝁 )

{ 𝜗 𝑚 ( 𝜀 , 𝒑 , 𝛼 ) ∣ 𝜗 𝑚 ( 𝜀 , 𝒑 , 𝛼 )

𝜽 − 𝒚 𝑚 ⊤ 𝜽 + 𝜀 + 𝛼 ‖ 𝒚 𝑚 ‖ 𝑽 𝒑 − 1 2 𝑽 𝒑 − 1 𝒚 𝑚 } ,

(EC.4.20)

where 𝒚 𝑚

𝒂 1 − 𝒂 𝑚 and 𝛼

0 . Then we have

𝒚 𝑚 ⊤ 𝜗 𝑚 ( 𝜀 , 𝒑 , 𝛼 )

𝜀 − 𝛼 < 𝜀 .

(EC.4.21)

Hence, by following a derivation analogous to that of the first part, we obtain

𝐶 𝑚 ( 𝒑 )

( 𝒚 𝑚 ⊤ 𝜽 − 𝜀 ) 2 2 ‖ 𝒚 𝑚 ‖ 𝑽 𝒑 − 1 2 .

(EC.4.22)

Deriving the tightest lower bound focuses on constructing alternative bandit instances that thoroughly challenge each 𝜀 -best arm configuration.

The minimum of the information function 𝐶 𝑥 over all culprits 𝑥 ∈ 𝒳 ( 𝛍 ) is then given by:

min ⁡ { min 𝑥 ∈ 𝒳 ⁡ 𝐶 𝑖 , 𝑗 ( 𝒑 ) , min 𝑥 ∈ 𝒳 ⁡ 𝐶 𝑚 ( 𝒑 ) } .

(EC.4.23)

Then, by Proposition 3.1 and the definition of 𝐶 𝑥 function, combining equation (EC.4.18) and equation (EC.4.22), the final lower bound can be expressed as

𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 / 2.4 𝛿 )

≥ min 𝒑 ∈ 𝑆 𝐾 ⁡ max ( 𝑖 , 𝑗 , 𝑚 ) ∈ 𝒳 ⁡ max ⁡ { 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒚 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 𝑚 ⊤ 𝜽 − 𝜀 ) 2 }

(EC.4.24)

= min 𝒑 ∈ 𝑆 𝐾 ⁡ max ( 𝑖 , 𝑗 , 𝑚 ) ∈ 𝒳 ⁡ max ⁡ { 2 ‖ 𝒂 𝑖 − 𝒂 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 𝑖 ⊤ 𝜽 − 𝒂 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒂 1 − 𝒂 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒂 1 ⊤ 𝜽 − 𝒂 𝑚 ⊤ 𝜽 − 𝜀 ) 2 } .

(EC.4.25) \Halmos Appendix EC.5Proof of Theorem 4.1

The proof of Theorem 4.1 consists of two parts. First, we establish the upper bound given in equation (EC.5.4). Then, we derive the final refined bound in equation (26), removing the summation in our first result.

To establish the result involving the summation in equation (EC.5.4), we define 𝑅 max

min ⁡ { 𝑟 : 𝐺 𝑟

𝐺 𝜀 } as the round in which the last 𝜀 -best arm is added to 𝐺 𝑟 . We divide the total number of samples into two phases: samples collected up to round 𝑅 max and those collected from round 𝑅 max + 1 until termination (if the algorithm does not stop at 𝑅 max ). The proof proceeds in eight steps, as outlined below.

EC.5.1Preliminary: Clean Events ℰ 1 and ℰ 2

We begin by defining two high-probability events, denoted as ℰ 1 and ℰ 2 , which we refer to as clean events.

ℰ 1

{ ⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | ≤ 𝐶 𝛿 / 𝐾 ( 𝑟 ) } .

(EC.5.1)

This event captures the condition that, for each round 𝑟 , the estimated means 𝜇 ^ 𝑖 ( 𝑟 ) of all arms 𝑖 in the active set 𝒜 𝐼 ( 𝑟 − 1 ) lie within a confidence bound 𝐶 𝛿 / 𝐾 ( 𝑟 ) of their true means 𝜇 𝑖 . It ensures uniform control over estimation errors across all arms and rounds with a prescribed confidence level.

Then, we introduce the following lemma to provide the confidence region for the estimated parameter 𝜽 ^ .

Lemma EC.1 (Lattimore and Szepesvári (2020))

Let confidence level 𝛿 ∈ ( 0 , 1 ) , for each arm 𝐚 ∈ 𝒜 , we have

ℙ { | ⟨ 𝜽 ^ − 𝜽 , 𝒂 ⟩ | ≥ 2 ‖ 𝒂 ‖ 𝑽 𝑡 − 1 2 log ⁡ ( 2 𝛿 ) } ≤ 𝛿 ,

(EC.5.2)

where 𝐕 𝑡 represents the information matrix defined in Section 2.2.

Let 𝜋 𝑟 denote the optimal allocation proportion computed for round 𝑟 . The corresponding sampling budget is then determined according to equation (21). Thus, we obtain the information matrix in each round, denoted as 𝑽 𝑟 6.

𝑽 𝑟

∑ 𝒂 ∈ Supp ( 𝜋 𝑟 ) 𝑇 𝑟 ( 𝒂 ) 𝒂 𝒂 ⊤ ⪰ 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) 𝑽 ( 𝜋 ) .

(EC.5.3)

Then, by applying Lemma EC.1 with 𝛿 replaced by 𝛿 / ( 𝐾 𝑟 ( 𝑟 + 1 ) ) , we obtain that for any arm 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) , with probability at least 1 − 𝛿 / ( 𝐾 𝑟 ( 𝑟 + 1 ) ) , we have

| ⟨ 𝜽 ^ 𝑟 − 𝜽 , 𝒂 ⟩ |
≤ 2 ‖ 𝒂 ‖ 𝑽 𝑟 − 1 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

2 𝒂 ⊤ 𝑽 𝑟 − 1 𝒂 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

≤ 2 𝒂 ⊤ ( 𝜀 𝑟 2 2 𝑑 1 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) 𝑽 ( 𝜋 ) − 1 ) 𝒂 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

≤ 𝜀 𝑟 ,

(EC.5.4)

where the third line follows from the matrix inequality in (EC.5.3), using the auxiliary result in Lemma EC.1, and the final line is derived from Lemma EC.5.

Thus, by applying the standard result of the G-optimal design in equation (EC.5.1), we define the confidence radius associated with the clean event ℰ 1 as

𝐶 𝛿 / 𝐾 ( 𝑟 ) ≔ 𝜀 𝑟 .

(EC.5.5)

Then, we have

ℙ ( ℰ 1 𝑐 )

ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋃ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | > 𝐶 𝛿 / 𝐾 ( 𝑟 ) }

≤ ∑ 𝑟

1 ∞ ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | > 𝜀 𝑟 }

≤ ∑ 𝑟

1 ∞ ∑ 𝑖

1 𝐾 𝛿 𝐾 𝑟 ( 𝑟 + 1 )

𝛿 ,

(EC.5.6)

where the second line follows from the union bound, and the third line combines the union bound with equation (EC.5.1). Therefore, we obtain

𝑃 ( ℰ 1 ) ≥ 1 − 𝛿 .

(EC.5.7)

Now consider another event, ℰ 2 , which characterizes the gaps between different arms, defined as

ℰ 2

{ ⋂ 𝑖 ∈ 𝐺 𝜀 ⋂ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | ≤ 2 𝜀 𝑟 } .

(EC.5.8)

This event ensures the gap between the estimated mean rewards of arms 𝑗 and 𝑖 is uniformly close to their true gap for all rounds 𝑟 , arms 𝑖 ∈ 𝐺 𝜀 , and arms 𝑗 in the active set 𝒜 𝐼 ( 𝑟 − 1 ) .

By (EC.5.1), for 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , we have

ℙ { | ( 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | > 2 𝜀 𝑟 ∣ ℰ 1 }
≤ ℙ { | 𝜇 ^ 𝑗 − 𝜇 𝑗 | + | 𝜇 ^ 𝑖 − 𝜇 𝑖 | > 2 𝜀 𝑟 ∣ ℰ 1 }

≤ ℙ { | 𝜇 ^ 𝑗 − 𝜇 𝑗 | > 𝜀 𝑟 ∣ ℰ 1 } + ℙ { | 𝜇 ^ 𝑖 − 𝜇 𝑖 | > 𝜀 𝑟 ∣ ℰ 1 }

0 ,

(EC.5.9)

which means

ℙ ( ℰ 2 ∣ ℰ 1 )

1 .

(EC.5.10) EC.5.2Step 1: Correctness

Recall that 𝐺 𝜀 denotes the true set of 𝜀 -best arms, and 𝐺 𝑟 is the empirical good set identified by LinFACT-G in round 𝑟 . Under event ℰ 1 , we first show that if there exists a round 𝑟 such that 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] , then it must be that 𝐺 𝑟

𝐺 𝜀 . This implies that under the clean event ℰ 1 , the stopping condition of LinFACT-G ensures correct identification of 𝐺 𝜀 .

Lemma EC.2

Under event ℰ 1 , we have 𝐺 𝑟 ⊆ 𝐺 𝜀 for all rounds 𝑟 ∈ ℕ .

Proof EC.3

Proof. We first show that 1 ∈ 𝒜 𝐼 ( 𝑟 ) for all 𝑟 ∈ ℕ ; that is, the best arm is never eliminated from the active set 𝒜 ( 𝑟 − 1 ) in any round 𝑟 on the event ℰ 1 . For any arm 𝑖 , we have

𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 ≥ 𝜇 1 ≥ 𝜇 𝑖 ≥ 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜀 𝑟

𝜇 ^ 𝑖 ( 𝑟 ) − 𝜀 𝑟 − 𝜀 ,

(EC.5.11)

which implies that 𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 > max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 − 𝜀 𝑟 − 𝜀

𝐿 𝑟 and 𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 ≥ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜀 𝑟 . These inequalities confirm that arm 1 will not be removed from the active set 𝒜 𝐼 ( 𝑟 − 1 ) .

Secondly, we show that at all rounds 𝑟 , 𝜇 1 − 𝜀 ∈ [ 𝐿 𝑟 , 𝑈 𝑟 ] . Since arm 1 never exists 𝒜 𝐼 ( 𝑟 − 1 ) ,

𝑈 𝑟

max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 + 𝜀 𝑟 − 𝜀 ≥ 𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 − 𝜀 ≥ 𝜇 1 − 𝜀 ,

(EC.5.12)

and for any arm 𝑖 ,

𝜇 1 − 𝜀 ≥ 𝜇 𝑖 − 𝜀 ≥ 𝜇 ^ 𝑖 − 𝜀 𝑟 − 𝜀 .

(EC.5.13)

Hence, taking the maximum over 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , we obtain

𝜇 1 − 𝜀 ≥ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 − 𝜀 𝑟 − 𝜀

𝐿 𝑟 .

(EC.5.14)

Next, we show that 𝐺 𝑟 ⊆ 𝐺 𝜀 for all 𝑟 ≥ 1 . By contradiction, if 𝐺 𝑟 ⊈ 𝐺 𝜀 , then it means that ∃ 𝑟 ∈ ℕ and ∃ 𝑖 ∈ 𝐺 𝜀 𝑐 ∩ 𝐺 𝑟 such that,

𝜇 𝑖 ≥ 𝜇 ^ 𝑖 − 𝜀 𝑟 ≥ 𝑈 𝑟 ≥ 𝜇 1 − 𝜀

𝜇 𝑖 ,

(EC.5.15)

which forms a contradiction. \Halmos

Lemma EC.4

Under event ℰ 1 , we have 𝐵 𝑟 ⊆ 𝐺 𝜀 𝑐 for all rounds 𝑟 ∈ ℕ .

Proof EC.5

Proof. Similarly, we proceed by contradiction. Consider the case that a good arm from 𝐺 𝜀 is added to 𝐵 𝑟 for some round 𝑟 . By definition, 𝐵 0

∅ and 𝐵 𝑟 − 1 ⊆ 𝐵 𝑟 for all 𝑟 . Then there must exist some 𝑟 ∈ ℕ and an 𝑖 ∈ 𝐺 𝜀 such that 𝑖 ∈ 𝐵 𝑟 and 𝑖 ∉ 𝐵 𝑟 − 1 . Following line 6 of Algorithm 4.1, this occurs if and only if

max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖

2 𝜀 𝑟 + 𝜀 .

(EC.5.16)

On the clean event ℰ 1 , the above implies ∃ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) such that

𝜇 𝑗 − 𝜇 𝑖 + 2 𝜀 𝑟 ≥ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ 2 𝜀 𝑟 + 𝜀 ,

(EC.5.17)

which yields 𝜇 𝑗 − 𝜇 𝑖 ≥ 𝜀 , contradicting that 𝑖 ∈ 𝐺 𝜀 . \Halmos

The above Lemma EC.2 and Lemma EC.4 show that under ℰ 1 , 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] can lead to the result 𝐺 𝑟

𝐺 𝜀 and 𝐵 𝑟

𝐺 𝜀 𝑐 . Since ℙ { ℰ 1 } ≥ 1 − 𝛿 , if LinFACT terminates, it can correctly provide the correct decision rule with a probability of least 1 − 𝛿 . Up to now, we have demonstrated the correctness of the algorithm’s stopping rule. Then we will focus on bounding the sample complexity in the following parts.

EC.5.3Step 2: Total Sample Count

To bound the expected sampling budget, we decompose the total number of samples into two parts: the budget used before the round when the last arm is added to the good set 𝐺 𝑟 , and the budget used after this round until termination. The total number of samples drawn by the algorithm can thus be represented as

𝑇
≤ ∑ 𝑟

1 ∞ 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑟

1 ∞ 𝟙 [ 𝐺 𝑟 − 1 ≠ 𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

(EC.5.18)

∑ 𝑟

1 ∞ 𝟙 [ 𝐺 𝑟 − 1
𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) .

(EC.5.19)

In the following parts, we will bound these two terms separately. We begin by analyzing a single arm 𝑖 ∈ 𝐺 𝜀 , tracking which sets it belongs to as the rounds progress.

For 𝑖 ∈ 𝐺 𝜀 , let 𝑅 𝑖 denote the number of rounds in which arm 𝑖 is sampled before being added to 𝐺 𝑟 in line 8 of Algorithm 4.1. For 𝑖 ∈ 𝐺 𝜀 𝑐 , let 𝑅 𝑖 denote the number of rounds in which arm 𝑖 is sampled before being removed from 𝒜 𝐼 ( 𝑟 − 1 ) and added to 𝐵 𝑟 in line 6 of Algorithm 4.1. Then, by definition, 𝑅 𝑖 can be expressed as

𝑅 𝑖

min ⁡ { 𝑟 : { 𝑖 ∈ 𝐺 𝑟

if 𝑖 ∈ 𝐺 𝜀

𝑖 ∉ 𝒜 𝐼 ( 𝑟 )
if 𝑖 ∈ 𝐺 𝜀 𝑐 }

min ⁡ { 𝑟 : { 𝜇 ^ 𝑖 − 𝜀 𝑟 ≥ 𝑈 𝑟

if 𝑖 ∈ 𝐺 𝜀

𝜇 ^ 𝑖 + 𝜀 𝑟 ≤ 𝐿 𝑟

if 𝑖 ∈ 𝐺 𝜀 𝑐 } .

(EC.5.20) Bound 𝑅 𝑖 .

We define a helper function ℎ ( 𝑥 )

log 2 ⁡ ( 1 / | 𝑥 | ) to facilitate the proof. It can be observed that in round 𝑟 , if 𝑟 ≥ ℎ ( 𝑥 ) , then 𝜀 𝑟

2 − 𝑟 ≤ | 𝑥 | .

Lemma EC.6

For any 𝑖 ∈ 𝐺 𝜀 , we have 𝑅 𝑖 ≤ ⌈ ℎ ( 0.25 ( 𝜀 − Δ 𝑖 ) ) ⌉ , where Δ 𝑖

𝜇 1 − 𝜇 𝑖 .

Proof EC.7

Proof. Note that for 𝑖 ∈ 𝐺 𝜀 , the inequality 4 𝜀 𝑟 < 𝜇 𝑖 − ( 𝜇 1 − 𝜀 ) holds when 𝑟

ℎ ( 0.25 ( 𝜀 − Δ 𝑖 ) ) . This implies that for all 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ,

𝜇 ^ 𝑖 − 𝜀 𝑟

≥ 𝜇 𝑖 − 2 𝜀 𝑟

𝜇 1 + 2 𝜀 𝑟 − 𝜀

≥ 𝜇 𝑗 + 2 𝜀 𝑟 − 𝜀

≥ 𝜇 ^ 𝑗 + 𝜀 𝑟 − 𝜀 .

(EC.5.21)

Thus, in particular, 𝜇 ^ 𝑖 − 𝜀 𝑟 > max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 ( 𝑡 ) + 𝜀 𝑟 − 𝜀

𝑈 𝑟 . Therefore, we conclude that 𝑅 𝑖 ≤ ⌈ ℎ ( 0.25 ( 𝜀 − Δ 𝑖 ) ) ⌉ . \Halmos

After determining the latest round in which any arm 𝑖 ∈ 𝐺 𝜀 is added to 𝐺 𝑟 , we define 𝑅 max as the round by which all good arms have been added to 𝐺 𝑟 , i.e., 𝑅 max ≔ min ⁡ { 𝑟 : 𝐺 𝑟

𝐺 𝜀 }

max 𝑖 ∈ 𝐺 𝜀 ⁡ 𝑅 𝑖 .

Lemma EC.8

𝑅 max ≤

⌈ ℎ ( 0.25 𝛼 𝜀 ) ⌉ .

Proof EC.9

Proof. Recall that 𝛼 𝜀

min 𝑖 ∈ 𝐺 𝜀 ⁡ 𝜇 𝑖 − 𝜇 1 + 𝜀

min 𝑖 ∈ 𝐺 𝜀 ⁡ 𝜀 − Δ 𝑖 . By Lemma EC.6, 𝑅 𝑖 ≤ ⌈ ℎ ( 0.25 ( 𝜀 − Δ 𝑖 ) ) ⌉ for 𝑖 ∈ 𝐺 𝜀 . Furthermore, ℎ ( ⋅ ) is monotonically decreasing if 𝑖 ∈ 𝐺 𝜀 . Then for any 𝛿 > 0 , 𝑅 max

max 𝑖 ∈ 𝐺 𝜀 ⁡ 𝑅 𝑖 ≤ max 𝑖 ∈ 𝐺 𝜀 ⁡ ⌈ ℎ ( 0.25 ( 𝜀 − Δ 𝑖 ) ) ⌉

⌈ ℎ ( min 𝑖 ∈ 𝐺 𝜀 ⁡ 𝜀 − Δ 𝑖 ) ⌉

⌈ ℎ ( 0.25 𝛼 𝜀 ) ⌉ . \Halmos

Bound the Total Samples up to 𝑅 𝑚 𝑎 𝑥 .

The total number of samples up to round 𝑅 max is ∑ 𝑟

1 𝑅 max ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 ) . By line 4 of the Algorithm 4.1, we have

𝑇 𝑟 ≤ 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 .

(EC.5.22)

Hence,

∑ 𝑟

1 𝑅 max 𝑇 𝑟

∑ 𝑟

1 𝑅 max ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

1 𝑅 max ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ ∑ 𝑟

1 𝑅 max 2 2 𝑟 + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max

≤ 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max ,

(EC.5.23)

where 𝑐 is a universal constant, and recall that 𝑅 max ≤

⌈ ℎ ( 0.25 𝛼 𝜀 ) ⌉ . The second line follows from equation (EC.5.22). The third line applies a scaling argument based on the range of the summation. The last line replaces the term 2 𝑟 + 1 with the largest round 𝑅 max and introduces a finite constant 𝑐 .

Next, we bound two terms in equation (EC.5.18) and equation (EC.5.19) separately. The first term, which represents the samples taken before round 𝑅 max , can be bounded in the following step.

Bound (EC.5.18).

Recall that 𝑅 max ≤ ⌈ ℎ ( 0.25 𝛼 𝜀 ) ⌉ is the round where 𝐺 𝑅 max

𝐺 𝜀 . We have

∑ 𝑟

1 ∞ 𝟙 [ 𝐺 𝑟 − 1 ≠ 𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑟

1 𝑅 max 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max ,

(EC.5.24)

where the second line follows from the definition of 𝑅 max , as no additional samples are collected after round 𝑅 max . The third line follows directly from equation (EC.5.23).

Bound (EC.5.19).

Next, we have

∑ 𝑟

1 ∞ 𝟙 [ 𝐺 𝑟 − 1

𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑟

𝑅 max + 1 ∞ 𝟙 [ 𝐵 𝑟 − 1 ≠ 𝐺 𝜀 𝑐 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

𝑅 max + 1 ∞ | 𝐺 𝜀 𝑐
𝐵 𝑟 − 1 | ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑟

𝑅 max + 1 ∞ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

𝑅 max + 1 ∞ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

1 ∞ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

1 ∞ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 ) ,

(EC.5.25)

where the second line follows from the definition of 𝑅 max , as 𝐺 𝑟

𝐺 𝜀 for all rounds beyond 𝑅 max . The third line uses the fact that as long as 𝐺 𝜀 𝑐
𝐵 𝑟 − 1 is not the empty set, the corresponding indicator function is 1. The fourth line considers the indicator function for each arm in 𝐺 𝜀 . The fifth and sixth lines exchange the order of the double summation and enlarge the summation range over the rounds 𝑟 .

EC.5.4Step 3: Bound the Expected Total Samples of LinFACT

We now take expectations over the total number of samples drawn for the given bandit instance 𝝁 . These expectations are conditioned on the high-probability event ℰ 1 .

𝔼 𝝁 [ 𝑇 𝐺 ∣ ℰ 1 ]
≤ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 ∪ 𝐵 𝑟 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 − 1 ≠ 𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

+ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 − 1

𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ( EC.5.24 ) 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max

+ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 − 1

𝐺 𝜀 ] 𝟙 [ 𝐺 𝑟 − 1 ∪ 𝐵 𝑟 − 1 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ( EC.5.25 ) 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max

+ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 ] ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 ) .

(EC.5.26)

The first line follows from the stopping condition of LinFACT-G and the additivity of the expectation. The second line applies the decomposition established in Step 2. The subsequent lines use the results from equation (EC.5.24) and equation (EC.5.25).

Next, we bound the last term. For a given 𝑖 ∈ 𝐺 𝜀 𝑐 and round 𝑟 , we first bound the probability that 𝑖 ∉ 𝐵 𝑟 . By the Borel-Cantelli lemma, this implies that the probability of 𝑖 never being added to any 𝐵 𝑟 is zero.

Lemma EC.10

For 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 ]

0 .

Proof EC.11

Proof.

First, we have for any 𝑖 ∈ 𝐺 𝜀 𝑐 ,

𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 , 𝑖 ∉ 𝐵 𝑚 ( 𝑚

{ 1 , 2 , … , 𝑟 − 1 } ) ]

≤ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 ] ,

(EC.5.27)

where the first line follows from the fact that arm 𝑖 ∈ 𝐺 𝜀 𝑐 will never be added into 𝐵 𝑟 from round 1 to round 𝑟 − 1 if 𝑖 ∉ 𝐵 𝑟 , and the second line accounts for the conditional expectation.

If 𝑖 ∈ 𝐵 𝑟 − 1 , then 𝑖 ∈ 𝐵 𝑟 by definition. Otherwise, if 𝑖 ∉ 𝐵 𝑟 − 1 , then under event ℰ 1 , for 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , we have

max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ Δ 𝑖 − 2 − 𝑟 + 1 ≥ 𝜀 + 2 𝜀 𝑟 ,

(EC.5.28)

which implies that 𝑖 ∈ 𝐵 𝑟 by line 6 of the Algorithm 4.1. In other words, we have established the correctness of the algorithm when line 6 is triggered in Step 1. We now specify the exact condition under which line 6 will occur. In particular, under event ℰ 1 , if 𝑖 ∉ 𝐵 𝑟 − 1 , then for all 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , we have

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 > 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 ]

1 .

(EC.5.29)

Therefore, for all 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , we have

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ]

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∈ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 )

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∈ 𝐵 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∈ 𝐵 𝑟 − 1 ∣ ℰ 1 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 ] 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ]

0 ,

(EC.5.30)

where the second line follows from the additivity of expectation. The fourth line follows the deterministic result that 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐵 𝑟 − 1 ]

0 . The fifth line is the decomposition based on the conditional expectation. The eighth line uses the fact that the expectation of the indicator function is simply the probability. The last line follows the result in equation (EC.5.29). The lemma then follows by combining this result with equation (EC.11). \Halmos

Lemma EC.12

For 𝑖 ∈ 𝐺 𝜀 𝑐 , we have

∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 log 2 ⁡ ( 8 Δ 𝑖 − 𝜀 ) + 𝜉 256 𝑑 ( Δ 𝑖 − 𝜀 ) 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 Δ 𝑖 − 𝜀 ) .

(EC.5.31) Proof EC.13

Proof.

∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

∑ 𝑟

1 ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

+ ∑ 𝑟

⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ + 1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

∑ 𝑟

1 ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 ) + 0

≤ ∑ 𝑟

1 ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ ( 𝑑 2 2 𝑟 + 1 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 log 2 ⁡ ( 8 Δ 𝑖 − 𝜀 ) + 2 𝑑 log ⁡ ( 2 𝐾 𝛿 ) ∑ 𝑟

1 ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ 2 2 𝑟 + 4 𝑑 ∑ 𝑟

1 ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ 2 2 𝑟 log ⁡ ( 𝑟 + 1 )

≤ 𝑑 ( 𝑑 + 1 ) 2 log 2 ⁡ ( 8 Δ 𝑖 − 𝜀 ) + 𝜉 256 𝑑 ( Δ 𝑖 − 𝜀 ) 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 Δ 𝑖 − 𝜀 ) ,

(EC.5.32)

Here, 𝜉 is a sufficiently large universal constant. The second line follows from decomposing the summation across all rounds. The fourth line uses Lemma EC.10. The fifth line bounds the expectation of the indicator function to its maximum value of 1. The sixth and seventh lines replace the round 𝑟 with its maximum value ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ . \Halmos

Summarizing the aforementioned results, we have

𝔼 𝝁 [ 𝑇 𝐺 ∣ ℰ 1 ]
≤ 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max

+ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 ] ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 ) ,

(EC.5.33)

where

𝑅 max ≤ ⌈ ℎ ( 0.25 𝛼 𝜀 ) ⌉

log 2 ⁡ 8 𝛼 𝜀 .

(EC.5.34)

Also, we have

∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 log 2 ⁡ ( 8 Δ 𝑖 − 𝜀 ) + 𝜉 256 𝑑 ( Δ 𝑖 − 𝜀 ) 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 Δ 𝑖 − 𝜀 ) .

(EC.5.35)

Then, we arrive at the final result as follows,

𝔼 𝝁 [ 𝑇 𝐺 ∣ ℰ 1 ]
≤ 𝑐 2 2 𝑅 max + 1 𝑑 log ⁡ ( 2 𝐾 𝑅 max ( 𝑅 max + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 𝑅 max

+ ∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 ] ] ( 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑐 256 𝑑 𝛼 𝜀 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ ( 16 𝛼 𝜀 ) ) + 𝑑 ( 𝑑 + 1 ) 2 log 2 ⁡ 8 𝛼 𝜀

∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ( 𝑑 ( 𝑑
1 ) 2 log 2 ⁡ ( 8 Δ 𝑖 − 𝜀 )
𝜉 256 𝑑 ( Δ 𝑖 − 𝜀 ) 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 Δ 𝑖 − 𝜀 ) ) ,

(EC.5.36)

which can be further expressed as

𝔼 [ 𝑇 𝐺 ∣ ℰ 1 ]

𝒪 ( 𝑑 𝛼 𝜀 − 2 log ⁡ ( 𝐾 𝛿 log ⁡ ( 𝛼 𝜀 − 2 ) ) + 𝑑 2 log ⁡ ( 𝛼 𝜀 − 1 ) )

∑ 𝑖 ∈ 𝐺 𝜀 𝑐 ( 𝒪 ( 𝑑 2 log ( Δ 𝑖 − 𝜀 ) − 1
𝑑 ( Δ 𝑖 − 𝜀 ) − 2 log ( 𝐾 𝛿 log ( Δ 𝑖 − 𝜀 ) − 2 ) ) ) .

(EC.5.37) EC.5.5A Refined Bound

The result obtained in the previous steps involves a summation over the set 𝐺 𝜀 , which can be further improved by eliminating this summation. Rather than focusing solely on the round 𝑅 max , defined in Lemma EC.10 as the round in which all arms in 𝐺 𝜀 are classified into 𝐺 𝑟 , we now define the round at which all classifications are complete, i.e., 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] .

Lemma EC.14

For 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 ]

0 .

Proof EC.15

Proof. First, for any 𝑖 ∈ 𝐺 𝜀

𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 , 𝑖 ∉ 𝐺 𝑚 ( 𝑚

{ 1 , 2 , … , 𝑟 − 1 } ) ]

≤ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 ] .

(EC.5.38)

If 𝑖 ∈ 𝐺 𝑟 − 1 , then 𝑖 ∈ 𝐺 𝑟 by definition. Otherwise, if 𝑖 ∉ 𝐺 𝑟 − 1 , then under event ℰ 1 , for 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ , we have

max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 𝜇 arg ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 𝑖 + 2 − 𝑟 + 1 ≤ Δ 𝑖 + 2 − 𝑟 + 1 ≤ 𝜀 − 2 𝜀 𝑟 ,

(EC.5.39)

which implies that 𝑖 ∈ 𝐺 𝑟 by line 8 of the Algorithm 4.1. In other words, we now specify the exact condition under which line 8 will occur. In particular, under event ℰ 1 , if 𝑖 ∉ 𝐺 𝑟 − 1 , for all 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ , we have

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ − 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 ]

1 .

(EC.5.40)

Deterministically, 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ]

0 . Therefore,

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 ]

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∉ 𝐺 𝑟 − 1 ∣ ℰ 1 )

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∈ 𝐺 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∈ 𝐺 𝑟 − 1 ∣ ℰ 1 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 ] ℙ ( 𝑖 ∉ 𝐺 𝑟 − 1 ∣ ℰ 1 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 ] 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 ]

0 ,

(EC.5.41)

where the second line comes from the additivity of expectation. The fourth line follows the deterministic result that 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ]

0 . The fifth line applies a decomposition based on the conditional expectation. The eighth line uses the fact that the expectation of an indicator function equals the corresponding probability. The final line follows from the result in equation (EC.5.40). The lemma then follows by combining this result with equation (EC.15). \Halmos

Lemma EC.16

The round at which all classifications are complete and the final answer is returned is 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } .

Proof EC.17

Proof. Combining the result of Lemma EC.14 with that of Lemma EC.10, we have the following: for 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 ]

0 ; for 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 ]

0 . Since 𝛼 𝜀

min 𝑖 ∈ 𝐺 𝜀 ⁡ ( 𝜀 − Δ 𝑖 ) and 𝛽 𝜀

min 𝑖 ∈ 𝐺 𝜀 𝑐 ⁡ ( Δ 𝑖 − 𝜀 ) , for any round 𝑟 ≥ 𝑅 upper , all arms have been classified into either 𝐺 𝑟 or 𝐵 𝑟 , marking the termination of the algorithm. \Halmos

Lemma EC.18

For the expected sample complexity conditioned on the high-probability event ℰ 1 , we have

𝔼 𝝁 [ 𝑇 𝒳 𝒴 ∣ ℰ 1 ]

≤ 𝜁 max ⁡ { 256 𝑑 𝛼 𝜀 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛼 𝜀 ) , 256 𝑑 𝛽 𝜀 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛽 𝜀 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ,

(EC.5.42)

where 𝜁 is a universal constant and 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } .

Proof EC.19

Proof. We have

𝔼 𝝁 [ 𝑇 𝒳 𝒴 ∣ ℰ 1 ]
≤ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 ∪ 𝐵 𝑟 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

1 𝑅 upper ( 𝑑 2 2 𝑟 + 1 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper + 2 𝑑 log ⁡ ( 2 𝐾 𝛿 ) ∑ 𝑟

1 𝑅 upper 2 2 𝑟 + 4 𝑑 ∑ 𝑟

1 𝑅 upper 2 2 𝑟 log ⁡ ( 𝑟 + 1 )

≤ 4 log ⁡ [ 2 𝐾 𝛿 ( 𝑅 upper + 1 ) ] ∑ 𝑟

1 𝑅 upper 𝑑 2 2 𝑟 + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper

(EC.5.43)

≤ 𝜁 max ⁡ { 256 𝑑 𝛼 𝜀 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛼 𝜀 ) , 256 𝑑 𝛽 𝜀 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛽 𝜀 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper .

(EC.5.44)

The second line follows from Lemma EC.16. Then, we have

𝔼 [ 𝑇 𝐺 ∣ ℰ ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log 2 ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) ,

(EC.5.45)

where 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 is the minimum gap between 𝛼 𝜀 and 𝛽 𝜀 , indicating the difficulty of the problem instance.

\Halmos Appendix EC.6Additional Insights into the Algorithm Optimality

From another perspective, we consider the relationship between the lower bound and the upper bound in the following section and give some additional insights into the algorithm optimality. This relationship serves as the basis for the derivation of a near-optimal upper bound in Theorem 4.2.

For ∀ 𝑖 ∈ ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 ) and ∀ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑗 ≠ 𝑖 , in round 𝑟 , we have

𝒚 𝑗 , 𝑖 ⊤ ( 𝜽 ^ 𝑟 − 𝜽 ) ≤ 2 𝜀 𝑟

(EC.6.1)

and

𝒚 𝑗 , 𝑖 ⊤ 𝜽 ^ 𝑟 − 𝜀 ≤ 𝒚 𝑗 , 𝑖 ⊤ 𝜽 + 2 𝜀 𝑟 − 𝜀 .

(EC.6.2) Lemma EC.1

Define 𝐺 𝑟 ′ ≔ { ∃ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑗 ≠ 𝑖 , 𝑖 : 𝐲 𝑗 , 𝑖 ⊤ 𝛉 − 𝜀

− 4 𝜀 𝑟 } . We always have ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 ∩ 𝐺 𝑟 𝑐 ) ⊆ 𝐺 𝑟 ′ .

Proof EC.2

Proof. For 𝑟

1 , the lemma follows directly from the assumption in Theorem 4.2 that max 𝑖 ∈ [ 𝐾 ] ⁡ | 𝜇 1 − 𝜀 − 𝜇 𝑖 | ≤ 2 . For 𝑟 ≥ 2 , we proceed by contradiction. Suppose 𝑖 ∈ ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 ∩ 𝐺 𝑟 𝑐 ) ∩ ( 𝐺 𝑟 ′ ) 𝑐 , then for every 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) with 𝑗 ≠ 𝑖 , we have

𝒚 𝑗 , 𝑖 ⊤ 𝜽 ≤ − 4 𝜀 𝑟 + 𝜀 .

(EC.6.3)

Hence, using equation (EC.6.2), for every 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) and 𝑗 ≠ 𝑖 , we have

𝒚 𝑗 , 𝑖 ⊤ 𝜽 ^ 𝑟 − 𝜀 ≤ − 2 𝜀 𝑟 ,

(EC.6.4)

which is exactly the condition for the algorithm to add arm 𝑖 into 𝐺 𝑟 at line 8 of the algorithm, which yields a contradiction and completes the argument. Moreover, note that when 𝑟 ≥ ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , we have 𝐺 𝑟 ′ ∩ 𝐺 𝜀

∅ . Furthermore, considering that 𝑖 ∈ 𝐺 𝜀 , we have 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀

0 . \Halmos

There is, however, one exceptional case that invalidates the above derivation. Specifically, when 𝑖 is the index of the arm with the largest mean value, i.e., 𝑖

arg ⁡ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , the proof no longer holds. For this situation to occur, it must be the case that arg ⁡ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ∈ 𝐺 𝑟 𝑐 , which is equivalent to 𝜀 ≤ 2 𝜀 𝑟 . Since this condition can only hold for a limited number of rounds, its impact is negligible and can therefore be ignored.

On the other hand, for 𝑖 ∈ ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 𝑐 ) , in any round 𝑟 , we have

𝒚 1 , 𝑖 ⊤ ( 𝜽 − 𝜽 ^ 𝑟 ) ≤ 2 𝜀 𝑟

(EC.6.5)

and

𝒚 1 , 𝑖 ⊤ 𝜽 ^ 𝑟 − 𝜀 ≥ 𝒚 1 , 𝑖 ⊤ 𝜽 − 2 𝜀 𝑟 − 𝜀 .

(EC.6.6) Lemma EC.3

Define 𝐵 𝑟 ′ ≔ { 𝑖 : 𝐲 1 , 𝑖 ⊤ 𝛉 − 𝜀 < 4 𝜀 𝑟 } . We always have ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 𝑐 ) ⊂ 𝐵 𝑟 ′ .

Proof EC.4

Proof. To establish this, first note that when 𝑟

1 , the lemma directly follows from the assumption in Theorem 4.2 that max 𝑖 ∈ [ 𝐾 ] ⁡ | 𝜇 1 − 𝜀 − 𝜇 𝑖 | ≤ 2 . For 𝑟 ≥ 2 , using the same contradiction argument, assume that 𝑖 ∈ ( 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 𝑐 ) ∩ ( 𝐵 𝑟 ′ ) 𝑐 . Then we must have

𝒚 1 , 𝑖 ⊤ 𝜽 ≥ 4 𝜀 𝑟 + 𝜀 .

(EC.6.7)

Consequently, by equation (EC.6.6), it follows that

𝒚 1 , 𝑖 ⊤ 𝜽 ^ 𝑟 − 𝜀 ≥ 2 𝜀 𝑟 .

(EC.6.8)

This is precisely the condition for the algorithm to add arm 𝑖 into 𝐵 𝑟 and eliminate it from 𝒜 𝐼 ( 𝑟 − 1 ) as specified in line 6 of the algorithm. This contradiction leads to the desired result. Moreover, when 𝑟 ≥ ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ , we have 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐

∅ . Finally, for 𝑖 ∈ 𝐺 𝜀 𝑐 , it follows that 𝐲 1 , 𝑖 ⊤ 𝛉 − 𝜀

0 . \Halmos

We now present a critical lemma that establishes the connection between the lower bound in Theorem 3.2 and the upper bound. Define 𝒢 𝒴 as the gauge of 𝒴 ( 𝒜 ) , where 𝒜 denotes the initial set of all arm vectors. The details are provided in Lemma EC.3.

Lemma EC.5

Considering the lower bound ( Γ ∗ ) − 1 in Theorem 3.2, we have

( Γ ∗ ) − 1 ≥ 𝒢 𝒴 2 𝐿 2 4 𝑅 upper 𝑑 𝐿 1 ∑ 𝑟

1 𝑅 upper 2 2 𝑟 − 3 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ,

(EC.6.9)

where 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } , and 𝐿 1 and 𝐿 2 are constants.

Proof EC.6

Proof. When round 𝑟 exceeds 𝑅 upper , we have 𝐺 𝑟 ′ ∩ 𝐺 𝜀

∅ and 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐

∅ , implying that 𝐺 𝑟 ∪ 𝐵 𝑟

[ 𝐾 ] and the algorithm terminates. From Theorem 3.2, we obtain

( Γ ∗ ) − 1

min 𝒑 ∈ 𝑆 𝐾 ⁡ max ( 𝑖 , 𝑗 , 𝑚 ) ∈ 𝒳 ⁡ max ⁡ { 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 1 , 𝑚 ⊤ 𝜽 − 𝜀 ) 2 }

min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑟 ≤ 𝑅 upper ⁡ max 𝑖 ∈ 𝐺 𝑟 ′ ∩ 𝐺 𝜀 ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⁡ max 𝑚 ∈ 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 𝑖 , 𝑗 ⊤ 𝜽 + 𝜀 ) 2 , 2 ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 𝒚 1 , 𝑚 ⊤ 𝜽 − 𝜀 ) 2 }

≥ min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑟 ≤ 𝑅 upper ⁡ max 𝑖 ∈ 𝐺 𝑟 ′ ∩ 𝐺 𝜀 ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⁡ max 𝑚 ∈ 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 4 𝜀 𝑟 ) 2 , 2 ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 4 𝜀 𝑟 ) 2 }

≥ (i) 1 𝑅 upper min 𝒑 ∈ 𝑆 𝐾 ∑ 𝑟

1 𝑅 upper max 𝑖 ∈ 𝐺 𝑟 ′ ∩ 𝐺 𝜀 ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⁡ max 𝑚 ∈ 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { 2 ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ( 4 𝜀 𝑟 ) 2 , 2 ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 ( 4 𝜀 𝑟 ) 2 }

≥ (ii) 1 𝑅 upper ∑ 𝑟

1 𝑅 upper 2 2 𝑟 − 3 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝐺 𝑟 ′ ∩ 𝐺 𝜀 ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⁡ max 𝑚 ∈ 𝐵 𝑟 ′ ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 , ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 }

≥ (iii) 1 𝑅 upper ∑ 𝑟

1 𝑅 upper 2 2 𝑟 − 3 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 ∩ 𝐺 𝑟 𝑐 ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⁡ max 𝑚 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 , ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 }

≥ (iv) 𝒢 𝒴 2 𝐿 2 𝑑 𝐿 1 1 𝑅 upper ∑ 𝑟

1 𝑅 upper 2 2 𝑟 − 3 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀
{ 1 } ⁡ max 𝑚 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ∩ 𝐺 𝜀 𝑐 ⁡ max ⁡ { ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2 , ‖ 𝒚 1 , 𝑚 ‖ 𝑽 𝒑 − 1 2 }

≥ (v) 𝒢 𝒴 2 𝐿 2 4 𝑅 upper 𝑑 𝐿 1 ∑ 𝑟

1 𝑅 upper 2 2 𝑟 − 3 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑖 ≠ 𝑗 ⁡ ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2

𝒢 𝒴 2 𝐿 2 32 𝑅 upper 𝑑 𝐿 1 ∑ 𝑟

1 𝑅 upper 2 2 𝑟 𝑔 𝒳 𝒴 ( 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ) ,

(EC.6.10)

where (i) follows from the fact that the maximum of positive numbers is always greater than or equal to their average, and (ii) uses the fact that the minimum of a sum is greater than or equal to the sum of the minimums. (iii) arises from the set inclusion relationships established in Lemma EC.1 and Lemma EC.3. (iv) is a direct consequence of Lemma EC.3 with 𝑗

1 . Finally, for (v), note that for any 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , we have max 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑖 ≠ 𝑗 ⁡ ‖ 𝐲 𝑖 , 𝑗 ‖ 𝐕 𝐩 − 1 2 ≤ 4 max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 )
{ 1 } ⁡ ‖ 𝐲 1 , 𝑖 ‖ 𝐕 𝐩 − 1 2 . \Halmos

Moreover, to provide additional insights into the G-optimal design, from equation (EC.5.43), we obtain

𝔼 𝝁 [ 𝑇 ∣ ℰ 1 ]
≤ 4 log ⁡ [ 2 𝐾 𝛿 ( 𝑅 upper + 1 ) ] ∑ 𝑟

1 𝑅 upper 𝑑 2 2 𝑟 + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper .

(EC.6.11)

From inequalities (EC.6.10) and (EC.6.11), it follows that to align the upper and lower bounds, one must establish 𝑑 ≤ 𝑐 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑖 ≠ 𝑗 ⁡ ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 for some universal constant. However, applying the Kiefer–Wolfowitz Theorem from Lemma EC.5 yields only the reverse inequality

min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , 𝑖 ≠ 𝑗 ⁡ ‖ 𝒚 𝑖 , 𝑗 ‖ 𝑽 𝒑 − 1 2 ≤ 4 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ ‖ 𝒂 𝑖 ‖ 𝑽 𝒑 − 1 2

4 𝑑 .

(EC.6.12)

This result indicates that LinFACT-G cannot achieve the lower bound established in Theorem 3.2.

Appendix EC.7Proof of Theorem 4.2

The central idea of this proof is to establish a direct relationship between the lower bound and the key minimax summation terms in the upper bound, thereby enabling the sample complexity to be bounded explicitly in terms of the lower bound term ( Γ ∗ ) − 1 . We first show that the good event ℰ 3 defined below occurs with probability at least 1 − 𝛿 𝑟 in each round 𝑟 , where 𝛿 𝑟 denotes the probability that the good event ℰ 3 does not hold in a certain round 𝑟 . By taking the union bound across different rounds, we can complete the proof of Lemma EC.1 regarding the overall probability of event ℰ 3 occurring, denoted by 𝛿 . We then demonstrate that the probability of this good event holding in every round is at least 1 − 𝛿 . Consequently, we can sum the sample complexity bounds from each round (conditioned on the good event) to obtain the overall bound on the sample complexity.

First, we define the good event ℰ 3 :

ℰ 3

⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⋂ 𝑟 ∈ ℕ | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | ≤ 2 𝜀 𝑟 .

(EC.7.1)

Since arms are sampled according to a preset allocation (i.e., a fixed design), we introduce the following lemma to provide a confidence region for the estimated parameter 𝜽 .

Lemma EC.1

Let 𝛿 ∈ ( 0 , 1 ) . Then, it holds that 𝑃 ( ℰ 3 ) ≥ 1 − 𝛿 .

Proof EC.2

Proof. Since 𝛉 ^ 𝑟 is a ordinary least squares estimator of 𝛉 and the noise is i.i.d., it follows that 𝐲 ⊤ ( 𝛉 − 𝛉 ^ 𝑟 ) is ‖ 𝐲 ‖ 𝐕 𝑟 − 1 2 -sub-Gaussian for all 𝐲 ∈ 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) . Moreover, the guarantees of the rounding procedure ensure that

‖ 𝒚 ‖ 𝑽 𝑟 − 1 2 ≤ ( 1 + 𝜖 ) 𝑔 𝒳 𝒴 ( 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ) / 𝑇 𝑟 ≤ 2 − 2 𝑟 − 1 log ⁡ 2 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) 𝛿

(EC.7.2)

for all 𝐲 ∈ 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) , as ensured by our choice of 𝑇 𝑟 in equation (22). Since the right-hand side is deterministic and does not depend on the randomness of the arm rewards, we have that for any 𝜌

0 and 𝐲 ∈ 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ,

ℙ { | 𝒚 ⊤ ( 𝜽 − 𝜽 ^ 𝑟 ) |

2 − 2 𝑟 log ⁡ ( 2 / 𝜌 ) log ⁡ 2 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) 𝛿 } ≤ 𝜌 .

(EC.7.3)

Letting 𝜌

𝛿 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) and applying a union bound over all possible 𝐲 ∈ 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) , where | 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) | ≤ | 𝒴 ( 𝒜 ( 0 ) ) | ≤ 𝐾 ( 𝐾 − 1 ) , we obtain the desired probability guarantee:

ℙ ( ℰ 3 𝑐 )

ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋃ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 ⋃ 𝑟 ∈ ℕ | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | > 𝜀 𝑟 }

≤ ∑ 𝑟

1 ∞ ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋃ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 )

𝑗 ≠ 𝑖 | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | > 𝜀 𝑟 }

≤ ∑ 𝑟

1 ∞ ∑ 𝑖

1 𝐾 ∑ 𝑗

1

𝑗 ≠ 𝑖 𝐾 𝛿 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 )

𝛿 .

(EC.7.4)

Taking the union bound over all rounds 𝑟 ∈ ℕ completes the proof. \Halmos

Thus, by the standard result of the 𝒳 𝒴 -optimal design, we have

𝐶 𝛿 / 𝐾 ( 𝑟 ) ≔ 𝜀 𝑟 ,

(EC.7.5)

which matches the expression in equation (EC.5.5) for the G-optimal design.

Lemma EC.3

On the event ℰ 3 , the best arm 1 ∈ 𝒜 𝐼 ( 𝑟 ) for all 𝑟 ∈ ℕ .

Proof EC.4

Proof. If the event ℰ 3 holds, then for any arm 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , we have

𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 ^ 1 ( 𝑟 ) ≤ 𝜇 𝑖 − 𝜇 1 + 2 𝜀 𝑟 ≤ 2 𝜀 𝑟 < 2 𝜀 𝑟 + 𝜀 ,

(EC.7.6)

which implies that 𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 > max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 − 𝜀 𝑟 − 𝜀

𝐿 𝑟 and 𝜇 ^ 1 ( 𝑟 ) + 𝜀 𝑟 ≥ max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 ( 𝑟 ) − 𝐶 𝛿 / 𝐾 ( 𝑟 ) .

These inequalities ensure that arm 1 will not be eliminated from 𝒜 𝐼 ( 𝑟 ) in LinFACT. \Halmos

Lemma EC.5

With probability at least 1 − 𝛿 , and employing an 𝜀 -efficient rounding procedure, LinFACT- 𝒳 𝒴 successfully identifies all 𝜀 -best arms and achieves instance-optimal sample complexity up to logarithmic factors, as given by

𝑇 ≤ 𝑐 [ 𝑑 𝑅 upper log ⁡ ( 2 𝐾 ( 𝑅 upper + 1 ) 𝛿 ) ] ( Γ ∗ ) − 1 + 𝑞 ( 𝜖 ) 𝑅 upper ,

(EC.7.7)

where 𝑐 is a universal constant, 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } .

Proof EC.6

Proof. Combining the result of Lemma EC.1, we conclude that with probability at least 1 − 𝛿

𝑇
≤ ∑ 𝑟

1 𝑅 upper max ⁡ { ⌈ 2 𝑔 𝒳 𝒴 ( 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ) ( 1 + 𝜖 ) 𝜀 𝑟 2 log ⁡ ( 2 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) 𝛿 ) ⌉ , 𝑞 ( 𝜖 ) }

≤ ∑ 𝑟

1 𝑅 upper 2 ⋅ 2 2 𝑟 𝑔 𝒳 𝒴 ( 𝒴 ( 𝒜 ( 𝑟 − 1 ) ) ) ( 1 + 𝜖 ) log ⁡ ( 2 𝐾 ( 𝐾 − 1 ) 𝑟 ( 𝑟 + 1 ) 𝛿 ) + ( 1 + 𝑞 ( 𝜖 ) ) 𝑅 upper

≤ [ 64 ( 1 + 𝜖 ) log ⁡ ( 2 𝐾 ( 𝐾 − 1 ) 𝑅 upper ( 𝑅 upper + 1 ) 𝛿 ) 𝑅 upper 𝑑 𝐿 1 𝒢 𝒴 2 𝐿 2 ] ( Γ ∗ ) − 1 + ( 1 + 𝑞 ( 𝜖 ) ) 𝑅 upper

≤ [ 128 ( 1 + 𝜖 ) log ⁡ ( 2 𝐾 ( 𝑅 upper + 1 ) 𝛿 ) 𝑅 upper 𝑑 𝐿 1 𝒢 𝒴 2 𝐿 2 ] ( Γ ∗ ) − 1 + ( 1 + 𝑞 ( 𝜖 ) ) 𝑅 upper

≤ 𝑐 [ 𝑑 𝑅 upper log ⁡ ( 2 𝐾 ( 𝑅 upper + 1 ) 𝛿 ) ] ( Γ ∗ ) − 1 + 𝑞 ( 𝜖 ) 𝑅 upper ,

(EC.7.8)

where 𝑐 is a universal constant, 𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } , and the third inequality follows from equation (EC.6.10).

Then, let 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 be the minimum gap of the problem instance. Considering that the approximation error term 𝑞 ( 𝜖 ) is in the form of 𝒪 ( 𝑑 𝜖 2 ) (Allen-Zhu et al. 2021, Fiez et al. 2019), we have

𝔼 [ 𝑇 𝒳 𝒴 ∣ ℰ ]

𝒪 ( 𝑑 Γ ∗ 𝜉 − 1 log ⁡ ( 𝜉 − 1 ) log ⁡ ( 𝐾 𝛿 log ⁡ ( 𝜉 − 2 ) ) + 𝑑 𝜖 2 log ⁡ ( 𝜉 − 1 ) ) .

(EC.7.9) \Halmos Appendix EC.8Proof of Theorem 5.1 EC.8.1Step 1: Define the Clean Events

The core of the proof lies in similarly defining the round at which all classifications are completed under the misspecified model. To this end, we reconstruct the anytime confidence radius for each arm in round 𝑟 and redefine the high-probability event over the entire execution of the algorithm. We denote this clean event as ℰ 1 𝑚 .

ℰ 1 𝑚

{ ⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | ≤ 𝜀 𝑟 + 𝐿 𝑚 𝑑 } .

(EC.8.1)

Similarly, since arms are sampled according to a preset allocation (i.e., the optimal design criterion), we invoke Lemma EC.1 to derive an adjusted confidence region for the estimated parameter 𝜽 ^ 𝑡 . When following the G-optimal sampling rule, as specified in lines 2 and 4 of the pseudocode, we obtain the following result for each round 𝑟 :

𝑽 𝑟

∑ 𝒂 ∈ Supp ( 𝜋 𝑟 ) 𝑇 𝑟 ( 𝒂 ) 𝒂 𝒂 ⊤ ⪰ 2 𝑑 𝜀 𝑟 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) 𝑽 ( 𝜋 ) .

(EC.8.2)

To give the confidence radius, we have the following decomposition.

| ⟨ 𝜽 ^ 𝑟 − 𝜽 , 𝒂 ⟩ |

| 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝑚 ( 𝒂 𝐴 𝑠 ) 𝒂 𝐴 𝑠 + 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝜂 𝑠 𝒂 𝐴 𝑠 |

≤ | 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 Δ 𝑚 ( 𝒂 𝐴 𝑠 ) 𝒂 𝐴 𝑠 | + | 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝜂 𝑠 𝒂 𝐴 𝑠 | ,

(EC.8.3)

where the first term is bounded by

| 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 Δ 𝑚 ( 𝒂 𝐴 𝑠 ) 𝒂 𝐴 𝑠 |

| 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝒃 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒃 ) Δ 𝑚 ( 𝒃 ) 𝒃 |

≤ 𝐿 𝑚 ∑ 𝒃 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒃 ) | 𝒂 ⊤ 𝑽 𝑟 − 1 𝒃 |

≤ 𝐿 𝑚 ( ∑ 𝒃 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒃 ) ) 𝒂 ⊤ ( ∑ 𝒃 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒃 ) 𝑽 𝑟 − 1 𝒃 𝒃 ⊤ 𝑽 𝑟 − 1 𝒂 )

𝐿 𝑚 ∑ 𝒃 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒃 ) ‖ 𝒂 ‖ 𝑽 𝑟 − 1 2

≤ 𝐿 𝑚 𝑑 ,

(EC.8.4)

where the first inequality follows from Hölder’s inequality, the second from Jensen’s inequality, and the last from the guarantee of the G-optimal exploration policy, which ensures that ‖ 𝒂 ‖ 𝑽 𝑟 − 1 2 ≤ 𝑑 / 𝑇 𝑟 .

The second term is also bounded using Lemma EC.1 and the result in Lemma EC.1. For any arm 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) , with a probability of at least 1 − 𝛿 / 𝐾 𝑟 ( 𝑟 + 1 ) , we have

| 𝒂 ⊤ 𝑽 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝜂 𝑠 𝒂 𝐴 𝑠 |
≤ 2 ‖ 𝒂 ‖ 𝑽 𝑟 − 1 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

2 𝒂 ⊤ 𝑽 𝑟 − 1 𝒂 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

≤ 2 𝒂 ⊤ ( 𝜀 𝑟 2 2 𝑑 1 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) 𝑽 ( 𝜋 ) − 1 ) 𝒂 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 )

≤ 𝜀 𝑟 .

(EC.8.5)

Thus, with the standard result of the G-optimal design, we also have

𝐶 𝛿 / 𝐾 ( 𝑟 ) ≔ 𝜀 𝑟 .

(EC.8.6)

To establish the probability guarantee for event ℰ 1 𝑚 , we combine the results from equations (EC.8.1) and (EC.8.1), yielding

ℙ ( ℰ 1 𝑚 𝑐 )

ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋃ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | > 𝜀 𝑟 + 𝐿 𝑚 𝑑 }

≤ ∑ 𝑟

1 ∞ ℙ { ⋃ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | > 𝜀 𝑟 + 𝐿 𝑚 𝑑 }

≤ ∑ 𝑟

1 ∞ ∑ 𝑖

1 𝐾 𝛿 𝐾 𝑟 ( 𝑟 + 1 )

𝛿 .

(EC.8.7)

Therefore, taking the union bounds over rounds 𝑟 ∈ ℕ , we have

𝑃 ( ℰ 1 𝑚 ) ≥ 1 − 𝛿 .

(EC.8.8)

Considering an additional event that characterizes the gaps between different arms, defined as follows

ℰ 2 𝑚

⋂ 𝑖 ∈ 𝐺 𝜀 ⋂ 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | ( 𝜇 ^ 𝑗 ( 𝑟 ) − 𝜇 ^ 𝑖 ( 𝑟 ) ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | ≤ 2 𝜀 𝑟 + 2 𝐿 𝑚 𝑑 .

(EC.8.9)

By equation (EC.8.1), for 𝑖 , 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , we have

ℙ { | ( 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ) − ( 𝜇 𝑗 − 𝜇 𝑖 ) | > 2 𝜀 𝑟 + 2 𝐿 𝑚 𝑑 ∣ ℰ 1 𝑚 }
≤ ℙ { | 𝜇 ^ 𝑗 − 𝜇 𝑗 | + | 𝜇 ^ 𝑖 − 𝜇 𝑖 | > 2 𝜀 𝑟 + 2 𝐿 𝑚 𝑑 ∣ ℰ 1 𝑚 }

≤ ℙ { | 𝜇 ^ 𝑗 − 𝜇 𝑗 | > 𝜀 𝑟 + 𝐿 𝑚 𝑑 ∣ ℰ 1 𝑚 }

+ ℙ { | 𝜇 ^ 𝑖 − 𝜇 𝑖 | > 𝜀 𝑟 + 𝐿 𝑚 𝑑 ∣ ℰ 1 𝑚 }

0 ,

(EC.8.10)

which implies

ℙ ( ℰ 2 𝑚 ∣ ℰ 1 𝑚 )

1 .

(EC.8.11) EC.8.2Step 2: Bound the Expected Sample Complexity

To bound the expected sample complexity under model misspecification, we aim to identify the round in which all arms 𝑖 ∈ 𝐺 𝜀 have been included in 𝐺 𝑟 , and the round in which all arms 𝑖 ∈ 𝐺 𝜀 𝑐 have been added to 𝐵 𝑟 .

Lemma EC.1

For 𝑖 ∈ 𝐺 𝜀 and 𝐿 𝑚 < 𝛼 𝜀 2 𝑑 , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 𝑑 ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 𝑚 ]

0 .

Proof EC.2

Proof. First, for any 𝑖 ∈ 𝐺 𝜀

𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 , 𝑖 ∉ 𝐺 𝑚 ( 𝑚

{ 1 , 2 , … , 𝑟 − 1 } ) ]

≤ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 ] .

(EC.8.12)

If 𝑖 ∈ 𝐺 𝑟 − 1 , then 𝑖 ∈ 𝐺 𝑟 by definition. Otherwise, if 𝑖 ∉ 𝐺 𝑟 − 1 , then under event ℰ 1 𝑚 , for 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have

max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 𝜇 arg ⁡ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 𝑖 + 2 − 𝑟 + 1 + 2 𝐿 𝑚 𝑑 ≤ Δ 𝑖 + 2 − 𝑟 + 1 + 2 𝐿 𝑚 𝑑 ≤ 𝜀 − 2 𝜀 𝑟 ,

(EC.8.13)

which implies that 𝑖 ∈ 𝐺 𝑟 by line 8 of the algorithm. In particular, under event ℰ 1 𝑚 , if 𝑖 ∉ 𝐺 𝑟 − 1 , for all 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ − 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 𝑚 ]

1 .

(EC.8.14)

Consequently, 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ]

0 . Therefore,

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∉ 𝐺 𝑟 − 1 ∣ ℰ 1 𝑚 )

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∈ 𝐺 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∈ 𝐺 𝑟 − 1 ∣ ℰ 1 𝑚 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∉ 𝐺 𝑟 − 1 ∣ ℰ 1 𝑚 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ − 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐺 𝑟 − 1 , ℰ 1 𝑚 ] 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

0 ,

(EC.8.15)

where the second line follows from the additivity of expectation. The fourth line follows the deterministic result that 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐺 𝑟 − 1 ]

0 . The fifth line is the decomposition based on the conditional expectation. The eighth line comes from the fact that the expectation of the indicator function is simply the probability. The last line follows the result in equation (EC.8.14). The lemma can thus be concluded together with equation (EC.2).

In the perfectly linear model, 𝜀 − Δ 𝑖

0 always holds for any 𝑖 ∈ 𝐺 𝜀 . However, under model misspecification, the sign of this term within the logarithm must be verified. To ensure positivity for all 𝑖 ∈ 𝐺 𝜀 , it is necessary that 𝛼 𝜀

2 𝐿 𝑚 𝑑 . As the misspecification magnitude 𝐿 𝑚 approaches 𝛼 𝜀 / ( 2 𝑑 ) , the upper bound on the expected sample complexity increases sharply, since the misspecification significantly impairs the identification of 𝜀 -best arms. Moreover, when 𝐿 𝑚 ≥ 𝛼 𝜀 / ( 2 𝑑 ) , the sample complexity can no longer be bounded in this form, which is intuitive and consistent with the general insights discussed earlier in Section 5.3. \Halmos

Lemma EC.3

For 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝐿 𝑚 < 𝛽 𝜀 2 𝑑 , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 𝑑 ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 𝑚 ]

0 .

Proof EC.4

Proof. First, we for any 𝑖 ∈ 𝐺 𝜀 𝑐 ,

𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 , 𝑖 ∉ 𝐵 𝑚 ( 𝑚

{ 1 , 2 , … , 𝑟 − 1 } ) ]

≤ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 ] .

(EC.8.16)

If 𝑖 ∈ 𝐵 𝑟 − 1 , then 𝑖 ∈ 𝐵 𝑟 by definition. Otherwise, if 𝑖 ∉ 𝐵 𝑟 − 1 , then under event ℰ 1 𝑚 , for 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have

max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≥ Δ 𝑖 − 2 − 𝑟 + 1 − 2 𝐿 𝑚 𝑑 ≥ 𝜀 + 2 𝜀 𝑟 ,

(EC.8.17)

which implies that 𝑖 ∈ 𝐵 𝑟 by line 6 of the algorithm. In particular, under event ℰ 1 𝑚 , if 𝑖 ∉ 𝐵 𝑟 − 1 , for all 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 > 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 𝑚 ]

1 .

(EC.8.18)

Deterministically, 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐵 𝑟 − 1 ]

0 . Therefore,

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∈ 𝐵 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 𝑚 )

+ 𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∈ 𝐵 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∈ 𝐵 𝑟 − 1 ∣ ℰ 1 𝑚 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 𝑚 ] ℙ ( 𝑖 ∉ 𝐵 𝑟 − 1 ∣ ℰ 1 𝑚 )

𝔼 𝝁 [ 𝟙 [ max 𝑗 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑗 − 𝜇 ^ 𝑖 ≤ 2 𝜀 𝑟 + 𝜀 ] ∣ 𝑖 ∉ 𝐵 𝑟 − 1 , ℰ 1 𝑚 ] 𝔼 𝝁 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 − 1 ] ∣ ℰ 1 𝑚 ]

0 .

(EC.8.19)

The second line follows from the linearity of expectation. The fourth line uses the deterministic fact that 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] 𝟙 [ 𝑖 ∈ 𝐵 𝑟 − 1 ]

0 . The fifth line applies the law of total expectation. The eighth line uses the identity that the expectation of an indicator function equals the corresponding probability. The final line follows from equation (EC.8.18). The lemma then follows by combining this result with equation (EC.4).

Similarly, we must ensure the positivity of the term inside the logarithm. To guarantee that Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 𝑑

0 for every 𝑖 ∈ 𝐺 𝜀 𝑐 , it is necessary that 𝛽 𝜀

2 𝐿 𝑚 𝑑 . \Halmos

Lemma EC.5

Suppose 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 𝑑 , 𝛽 𝜀 2 𝑑 } , then the round by which all arms are classified is 𝑅 upper ′

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ⌉ } under model misspecification.

Proof EC.6

Proof. Combining the results of Lemmas EC.1 and EC.3, we define an auxiliary round 𝑅 𝑚 to facilitate the derivation of the upper bound. Specifically, for 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 𝑚 ]

0 . Similarly, for 𝑖 ∈ 𝐺 𝜀 and 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 𝑑 ) ⌉ , we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 𝑚 ]

0 .

Noting that 𝛼 𝜀

min 𝑖 ∈ 𝐺 𝜀 ⁡ ( 𝜀 − Δ 𝑖 ) and 𝛽 𝜀

min 𝑖 ∈ 𝐺 𝜀 𝑐 ⁡ ( Δ 𝑖 − 𝜀 ) , it follows that for any round 𝑟 ≥ 𝑅 upper ′ , all arms have been included in either 𝐺 𝑟 or 𝐵 𝑟 , marking the termination of the algorithm.

\Halmos Lemma EC.7

Under model specification, for the expected sample complexity conditioned on the high-probability event ℰ 1 𝑚 , we have

𝔼 𝝁 [ 𝑇 𝐺 mis ∣ ℰ 1 𝑚 ]

≤ 𝑐 max { 256 𝑑 ( 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ) 2 log ( 2 𝐾 𝛿 log 2 16 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ) ,

256 𝑑 ( 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) 2 log ( 2 𝐾 𝛿 log 2 16 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′ ,

(EC.8.20)

where 𝑐 is a universal constant and 𝑅 upper ′

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ⌉ } .

Proof EC.8

Proof. We can also decompose 𝑇 as in equation (EC.5.26), where all expectations are conditioned on the high-probability event ℰ 1 𝑚 , as given by

𝔼 𝝁 [ 𝑇 𝐺 mis ∣ ℰ 1 𝑚 ]
≤ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 ∪ 𝐵 𝑟 ≠ [ 𝐾 ] ] ∣ ℰ 1 𝑚 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

1 𝑅 upper ′ ( 𝑑 2 2 𝑟 + 1 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′ + 2 𝑑 log ⁡ ( 2 𝐾 𝛿 ) ∑ 𝑟

1 𝑅 upper ′ 2 2 𝑟 + 4 𝑑 ∑ 𝑟

1 𝑅 upper ′ 2 2 𝑟 log ⁡ ( 𝑟 + 1 )

≤ 4 log ⁡ [ 2 𝐾 𝛿 ( 𝑅 upper ′ + 1 ) ] ∑ 𝑟

1 𝑅 upper ′ 𝑑 2 2 𝑟 + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′

≤ 𝑐 max { 256 𝑑 ( 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ) 2 log ( 2 𝐾 𝛿 log 2 16 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 ) ,

256 𝑑 ( 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) 2 log ( 2 𝐾 𝛿 log 2 16 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′ .

(EC.8.21)

Then, let 𝜉

min ⁡ ( 𝛼 𝜀 − 2 𝐿 𝑚 𝑑 , 𝛽 𝜀 − 2 𝐿 𝑚 𝑑 ) / 16 , we have

𝔼 𝝁 [ 𝑇 𝐺 mis ∣ ℰ 1 𝑚 ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(EC.8.22) \Halmos Appendix EC.9Proof of Theorem 5.2 EC.9.1Step 1: Confidence Radius

In equation (EC.8.1), the first term—arising from the unknown model misspecification—is unavoidable without prior knowledge. However, rather than focusing on the distance between the true parameter 𝜽 and its estimator 𝜽 ^ 𝑡 , we instead compare the orthogonal projection 𝜽 𝑡 with 𝜽 ^ 𝑡 in the direction of 𝒙 ∈ ℝ 𝑑 .

Building on the definition of the empirically optimal vector 𝝁 ^ 𝑜 ( 𝑟 ) , with ( 𝜽 ^ 𝑜 ( 𝑟 ) , 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) ) as its associated solution, and the orthogonal parameterization ( 𝜽 𝑟 , 𝚫 𝑚 ( 𝑟 ) ) , we derive the confidence radius for each arm via the following decomposition. For any 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , let 𝜇 ^ 𝑖 ( 𝑟 )

𝜇 ^ 𝑜 , 𝑖 ( 𝑟 ) denote the value of the optimal estimator on arm 𝑖 in round 𝑟 , then we have

| 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 |

| ⟨ 𝜽 ^ 𝑜 ( 𝑟 ) − 𝜽 , 𝒂 𝑖 ⟩ + 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) − Δ 𝑚 𝑖 |

≤ | ⟨ 𝜽 ^ 𝑜 ( 𝑟 ) − 𝜽 𝑟 , 𝒂 𝑖 ⟩ | + | ⟨ 𝜽 𝑟 − 𝜽 , 𝒂 𝑖 ⟩ | + | 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) − Δ 𝑚 𝑖 | ,

(EC.9.1)

where the third term is bounded by definition as | 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) − Δ 𝑚 𝑖 | ≤ 2 𝐿 𝑚 , while the first two terms can be bounded using the auxiliary lemmas provided below.

Lemma EC.1

Let 𝑟 be any round such that 𝐕 𝑟 is invertible. Consider the orthogonal parameterization ( 𝛉 𝑟 , 𝚫 𝑚 ( 𝑟 ) for 𝛍

𝚿 𝛉 + 𝚫 𝑚 with ‖ 𝚫 𝐦 ‖ ∞ ≤ 𝐿 𝑚 . Then

‖ 𝜽 𝑟 − 𝜽 ‖ 𝑽 𝑟 ≤ 𝐿 𝑚 𝑇 𝑟 ,

(EC.9.2)

where 𝑇 𝑟 is defined in equation (33).

Proof EC.2

Proof. We use the expression 𝛉 𝑟

𝛉 + 𝐕 𝑟 − 1 𝚿 ⊤ 𝐃 𝐍 𝑟 𝚫 𝑚 derived above. Let 𝐏 𝐍 𝑟

𝚿 𝐍 𝑟 ( 𝚿 𝐍 𝑟 ⊤ 𝚿 𝐍 𝑟 ) † 𝚿 𝐍 𝑟 ⊤ be a projection, we have

‖ 𝜽 𝑟 − 𝜽 ‖ 𝑽 𝑟

‖ 𝑽 𝑟 − 1 𝚿 ⊤ 𝑫 𝑵 𝑟 𝚫 𝒎 ‖ 𝑽 𝑟

𝚫 𝒎 ⊤ 𝑫 𝑵 𝑟 𝚿 𝑽 𝑟 − 1 𝚿 ⊤ 𝑫 𝑵 𝑟 𝚫 𝒎

‖ 𝑫 𝑵 𝑟 1 / 2 𝚫 𝒎 ‖ 𝑷 𝑵 𝑟

≤ ‖ 𝑫 𝑵 𝑟 1 / 2 𝚫 𝒎 ‖

‖ 𝚫 𝒎 ‖ 𝐷 𝑵 𝑟

≤ 𝐿 𝑚 𝑇 𝑟 .

(EC.9.3) \Halmos Lemma EC.3

(Réda et al. (2021), Lemma 10). Let 𝛍 ^ 𝑜 ( 𝑟 )

𝚿 𝛉 ^ 𝑜 ( 𝑟 ) + 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) in round 𝑟 , where ( 𝛉 ^ 𝑜 ( 𝑟 ) , 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) ) are the solution of (30).Then the following relationship holds.

‖ 𝜽 ^ 𝑜 ( 𝑟 ) − 𝜽 𝑟 ‖ 𝑽 𝑟 2 ≤ ‖ 𝜽 ^ 𝑟 − 𝜽 𝑟 ‖ 𝑽 𝑟 2 .

(EC.9.4) Lemma EC.4

(Lattimore and Szepesvári (2020), Section 20). Let 𝛿 ∈ ( 0 , 1 ) . Then, with a probability of at least 1 − 𝛿 , it holds that for all 𝑡 ∈ ℕ ,

‖ 𝜽 ^ 𝑡 − 𝜽 ‖ 𝑉 𝑡 < 2 2 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 1 𝛿 ) ) .

(EC.9.5)

Equation (EC.9.5) is not directly applicable in our setting due to the deviation term introduced by model misspecification, as shown in equation (EC.8.1). However, by leveraging the orthogonal parameterization, we obtain

‖ 𝜽 ^ 𝑟 − 𝜽 𝑟 ‖ 𝑽 𝑟 2

‖ 𝑽 𝑟 − 1 Ψ ⊤ 𝑆 𝑟 ‖ 𝑽 𝑟 2

‖ Ψ ⊤ 𝑆 𝑟 ‖ 𝑽 𝑟 − 1 2 ,

(EC.9.6)

where 𝑆 𝑟

∑ 𝑠

1 𝑇 𝑟 𝑋 𝐴 𝑠 − 𝜇 𝐴 𝑠 is the standard self-normalized quantity in the linear bandit literature, allowing existing techniques—such as various concentration inequalities—to be applied directly in the presence of model misspecification. This observation leads to the conclusion that, under misspecification, for any round 𝑟 ∈ ℕ , it holds with probability at least 1 − 𝛿 that

‖ 𝜽 ^ 𝑟 − 𝜽 𝑟 ‖ 𝑉 𝑡 < 2 2 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 1 𝛿 ) ) .

(EC.9.7)

This result serves as the basis for designing the sampling budget and conducting theoretical analyses.

Together with Lemmas EC.1, EC.3, and EC.4, the distance | 𝜇 ~ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | can be bounded as follows: with probability at least 1 − 𝛿 / ( 𝐾 𝑟 ( 𝑟 + 1 ) ) , we have

| 𝜇 ~ 𝑖 ( 𝑟 ) − 𝜇 𝑖 |

≤ | ⟨ 𝜽 ^ 𝑜 ( 𝑟 ) − 𝜽 𝑟 , 𝒂 𝑖 ⟩ | + | ⟨ 𝜽 𝑟 − 𝜽 , 𝒂 𝑖 ⟩ | + | 𝚫 ^ 𝑚 𝑜 ( 𝑟 ) − Δ 𝑚 𝑖 |

≤ ‖ 𝜽 ^ 𝑜 ( 𝑟 ) − 𝜽 𝑟 ‖ 𝑽 𝑟 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 + ‖ 𝜽 𝑟 − 𝜽 ‖ 𝑽 𝑟 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 + 2 𝐿 𝑚

≤ ‖ 𝜽 ^ 𝑟 − 𝜽 𝑟 ‖ 𝑽 𝑟 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 + ‖ 𝜽 𝑟 − 𝜽 ‖ 𝑽 𝑟 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 + 2 𝐿 𝑚

≤ ( 2 2 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) ) + 𝐿 𝑚 𝑇 𝑟 ) 𝑑 𝑇 𝑟 + 2 𝐿 𝑚

≤ 𝜀 𝑟 + ( 𝑑 + 2 ) 𝐿 𝑚 .

(EC.9.8)

Then, we define a new clean event

ℰ 1 𝑚 ′

{ ⋂ 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⋂ 𝑟 ∈ ℕ | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | ≤ 𝜀 𝑟 + ( 𝑑 + 2 ) 𝐿 𝑚 } .

(EC.9.9)

This mirrors the event defined in equation (EC.8.1), allowing us to derive all corresponding results in Section EC.8.

EC.9.2Step 2: Bound the Expected Sample Complexity Lemma EC.5

For 𝑖 ∈ 𝐺 𝜀 and 𝐿 𝑚 < 𝛼 𝜀 2 ( 𝑑 + 2 ) , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 𝑚 ′ ]

0 .

Lemma EC.6

For 𝑖 ∈ 𝐺 𝜀 𝑐 and 𝐿 𝑚 < 𝛽 𝜀 2 ( 𝑑 + 2 ) , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 𝑚 ′ ]

0 .

Lemma EC.7

For the misspecification magnitude 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 ( 𝑑 + 2 ) , 𝛽 𝜀 2 ( 𝑑 + 2 ) } , the round 𝑅 upper ′′

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ⌉ } marks the point at which all classifications are completed and the algorithm terminates under model misspecification when using an estimation procedure based on orthogonal parameterization.

The proofs of Lemmas EC.5, EC.6, and EC.7 follow similar arguments to those presented in Section EC.8 and are therefore omitted for brevity.

Lemma EC.8

For the expected sample complexity given the high probability event ℰ 1 𝑚 ′ , we have

𝔼 𝝁 [ 𝑇 𝐺 op ∣ ℰ 1 𝑚 ′ ]

≤ 𝑐 max { 256 𝑑 ( 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) 2 log ( 𝐾 6 𝑑 𝛿 log 2 8 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) ,

256 𝑑 ( 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) 2 log ( 𝐾 6 𝑑 𝛿 log 2 8 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′′ ,

(EC.9.10)

where 𝑐 is a universal constant and 𝑅 upper ′′

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ⌉ } denotes the round in which all classifications are completed and the algorithm terminates under model misspecification with orthogonal parameterization.

Proof EC.9

Proof. The decomposition of 𝑇 in equation (EC.5.26) can be reformulated, where the expectations are conditioned on the high-probability event ℰ 1 𝑚 ′ , given by

𝔼 𝝁 [ 𝑇 𝐺 op ∣ ℰ 1 𝑚 ′ ]
≤ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 ∪ 𝐵 𝑟 ≠ [ 𝐾 ] ] ∣ ℰ 1 𝑚 ′ ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

1 𝑅 upper ′′ ( 𝑑 2 2 𝑟 + 3 ( 𝑑 log ⁡ ( 6 ) + log ⁡ ( 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′′ + 8 𝑑 log ⁡ ( 𝐾 6 𝑑 𝛿 ) ∑ 𝑟

1 𝑅 upper ′′ 2 2 𝑟 + 16 𝑑 ∑ 𝑟

1 𝑅 upper ′′ 2 2 𝑟 log ⁡ ( 𝑟 + 1 )

≤ 16 log ⁡ [ 𝐾 6 𝑑 𝛿 ( 𝑅 upper ′′ + 1 ) ] ∑ 𝑟

1 𝑅 upper ′′ 𝑑 2 2 𝑟 + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′′

(EC.9.11)

≤ 𝑐 max { 256 𝑑 ( 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) 2 log ( 𝐾 6 𝑑 𝛿 log 2 16 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) ,

256 𝑑 ( 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) 2 log ( 𝐾 6 𝑑 𝛿 log 2 16 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 upper ′′ ,

(EC.9.12)

where 𝑐 is a universal constant. Then, let 𝜉

min ⁡ ( 𝛼 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) , 𝛽 𝜀 − 2 𝐿 𝑚 ( 𝑑 + 2 ) ) / 16 , we have

𝔼 𝝁 [ 𝑇 𝐺 op ∣ ℰ ]

𝒪 ( 𝑑 𝜉 − 2 log ⁡ ( 𝐾 6 𝑑 𝛿 log ⁡ ( 𝜉 − 1 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(EC.9.13) \Halmos Appendix EC.10Proof of Proposition 5.4

The proof of this proposition closely follows that of Theorem 5.1 in Section EC.8, with the only modification being the definition of 𝐶 𝛿 / 𝐾 ( 𝑟 ) , which is now set to 𝜀 𝑟 + 𝐿 𝑚 𝑑 given that 𝐿 𝑚 is known in advance. The conclusions in Section EC.8.1 remain valid. Additionally, the marginal round in Lemma EC.1 is updated from ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 − 2 𝐿 𝑚 𝑑 ) ⌉ to ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ . The remainder of the proof follows directly by applying the same reasoning steps.

Appendix EC.11Proof of Theorem 6.1 EC.11.1Step 1: Rearrange the Clean Event

Following the derivation approach in Section EC.5.5, the core of the proof is to similarly identify the round in which all classifications are completed under the GLM setting. To this end, we reconstruct the anytime confidence radius for the arms in each round 𝑟 and define a high-probability event that holds throughout the execution of the algorithm.

Let 𝑽 ˇ 𝑟

∑ 𝑠

1 𝑇 𝑟 𝜇 ˙ link ( 𝒂 ⊤ 𝜽 ˇ 𝑟 ) 𝒂 𝑠 𝒂 𝑠 ⊤ , where 𝜽 ˇ 𝑟 is some convex combination of true parameter 𝜽 and parameter 𝜽 ^ 𝑟 based on maximum likelihood estimation (MLE). It can be checked that the unweighted matrix 𝑽 𝑟

∑ 𝑠

1 𝑇 𝑟 ( 𝒂 ⊤ 𝜽 ˇ 𝑟 ) 𝒂 𝑠 𝒂 𝑠 ⊤ in the standard linear model is the special case of this newly defined matrix 𝑽 ˇ 𝑟 when the inverse link function 𝜇 link ( 𝑥 )

𝑥 .

For each arm 𝑖 , we define the following auxiliary vector

𝑊 𝑖

( 𝑊 𝑖 , 1 , 𝑊 𝑖 , 2 , … , 𝑊 𝑖 , 𝑇 𝑟 )

𝒂 𝑖 ⊤ 𝑽 ˇ 𝑟 − 1 ( 𝒂 𝐴 1 , 𝒂 𝐴 2 , … , 𝒂 𝐴 𝑇 𝑟 ) ∈ ℝ 𝑇 𝑟 ,

(EC.11.1)

and thus we have

‖ 𝑊 𝑖 ‖ 2 2

𝑊 𝑖 𝑊 𝑖 ⊤

𝒂 𝑖 ⊤ 𝑽 ˇ 𝑟 − 1 𝑽 𝑟 𝑽 ˇ 𝑟 − 1 𝒂 𝑖 .

(EC.11.2)

To give the confidence radius under GLM, we have the following statement for any arm 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) in round 𝑟 .

| 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 |

| 𝒂 𝑖 ⊤ ( 𝜽 ^ 𝑟 − 𝜽 ) |

| 𝒂 𝑖 ⊤ 𝑽 ˇ 𝑟 − 1 ∑ 𝑠

1 𝑇 𝑟 𝒂 𝐴 𝑠 𝜂 𝑠 |

| ∑ 𝑠

1 𝑇 𝑟 𝑊 𝑖 , 𝑠 𝜂 𝐴 𝑠 | ,

(EC.11.3)

where the second equality is established with Lemma 1 of Kveton et al. (2023) and 𝑇 𝑟 , which is defined in equation (40), is the adjusted sampling budget in each round 𝑟 for the GLM. Since ( 𝜂 𝐴 𝑠 ) 𝑠 ∈ 𝑇 𝑟 are independent, mean zero, 1-sub-Gaussian random variables, then ∑ 𝑠

1 𝑇 𝑟 𝑊 𝑖 , 𝑠 𝜂 𝐴 𝑠 is a ‖ 𝑊 𝑖 ‖ 2 -sub-Gaussian variable for each arm 𝑖 , then we have

ℙ ( | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 |

𝜀 𝑟 )

≤ 2 exp ( − 𝜀 𝑟 2 2 ‖ 𝑊 𝑖 ‖ 2 2 ) .

(EC.11.4)

Since 𝜽 ˇ 𝑟 is not known in the process, we need to find another way to represent this term. By assumption, we know 𝜇 ˙ link ≥ 𝑐 min for some 𝑐 min ∈ ℝ + and for all 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) . Therefore 𝑐 min − 1 𝑽 𝑟 − 1 ⪰ 𝑽 ˇ 𝑟 − 1 by definition of 𝑽 ˇ 𝑟 , and then we have

‖ 𝑊 𝑖 ‖ 2 2

𝒂 𝑖 ⊤ 𝑽 ˇ 𝑟 − 1 𝑽 𝑟 𝑽 ˇ 𝑟 − 1 𝒂 𝑖

≤ 𝒂 𝑖 ⊤ 𝑐 min − 1 𝑽 𝑟 − 1 𝑽 𝑟 𝑐 min − 1 𝑽 𝑟 − 1 𝒂 𝑖

𝑐 min − 2 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 2 .

(EC.11.5)

Furthermore, if G-optimal design is considered, we have ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 2 ≤ 𝑑 𝑇 𝑟 . Together with equation (EC.11.4), we have

ℙ ( | 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 |

𝜀 𝑟 )

≤ 2 exp ( − 𝜀 𝑟 2 2 𝑐 min − 2 ‖ 𝒂 𝑖 ‖ 𝑽 𝑟 − 1 2 )

≤ 2 exp ( − 𝜀 𝑟 2 𝑐 min 2 2 𝑑 𝑇 𝑟 ) .

(EC.11.6)

Finally, considering the definition of 𝑇 𝑟 in equation (40), with a probability of at least 1 − 𝛿 / 𝐾 𝑟 ( 𝑟 + 1 ) , we have

| 𝜇 ^ 𝑖 ( 𝑟 ) − 𝜇 𝑖 | ≤ 𝜀 𝑟 .

(EC.11.7)

Thus, with the standard result of the G-optimal design, we still have

𝐶 𝛿 / 𝐾 ( 𝑟 ) ≔ 𝜀 𝑟 ,

(EC.11.8)

with which the events ℰ 1 and ℰ 2 in Section EC.5.1 hold with a probability of at least 1 − 𝛿 .

EC.11.2Step 2: Bound the Expected Sample Complexity Lemma EC.1

For 𝑖 ∈ 𝐺 𝜀 , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 𝜀 − Δ 𝑖 ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐺 𝑟 ] ∣ ℰ 1 ]

0 .

Lemma EC.2

For 𝑖 ∈ 𝐺 𝜀 𝑐 , if 𝑟 ≥ ⌈ log 2 ⁡ ( 4 Δ 𝑖 − 𝜀 ) ⌉ , then we have 𝔼 𝛍 [ 𝟙 [ 𝑖 ∉ 𝐵 𝑟 ] ∣ ℰ 1 ]

0 .

Lemma EC.3

𝑅 GLM

𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } is the round where all the classifications have been finished and the answer is returned under GLM.

The proofs of Lemmas EC.1, EC.2, and EC.3 closely follow those in Section EC.5.5.

Lemma EC.4

For the expected sample complexity with high probability event ℰ 1 , we have

𝔼 𝝁 [ 𝑇 GLM ∣ ℰ 1 ]

≤ 𝑐 max ⁡ { 256 𝑑 𝛼 𝜀 2 𝑐 min 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛼 𝜀 ) , 256 𝑑 𝛽 𝜀 2 𝑐 min 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛽 𝜀 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 GLM ,

(EC.11.9)

where 𝑐 is a universal constant, 𝑐 min is the known constant controlling the first-order derivative of the inverse link function, and 𝑅 GLM

𝑅 upper

max ⁡ { ⌈ log 2 ⁡ 4 𝛼 𝜀 ⌉ , ⌈ log 2 ⁡ 4 𝛽 𝜀 ⌉ } is the round where all the classifications have been finished and the answer is returned under GLM.

Proof EC.5

Proof. We also consider the decomposition of 𝑇 in equation (EC.5.26), where all expectations are conditioned on the high-probability event ℰ 1 , given by

𝔼 𝝁 [ 𝑇 GLM ∣ ℰ 1 ]
≤ ∑ 𝑟

1 ∞ 𝔼 𝝁 [ 𝟙 [ 𝐺 𝑟 ∪ 𝐵 𝑟 ≠ [ 𝐾 ] ] ∣ ℰ 1 ] ∑ 𝒂 ∈ 𝒜 ( 𝑟 − 1 ) 𝑇 𝑟 ( 𝒂 )

≤ ∑ 𝑟

1 𝑅 GLM ( 𝑑 2 2 𝑟 + 1 𝑐 min 2 log ⁡ ( 2 𝐾 𝑟 ( 𝑟 + 1 ) 𝛿 ) + 𝑑 ( 𝑑 + 1 ) 2 )

≤ 𝑑 ( 𝑑 + 1 ) 2 𝑅 GLM + 2 𝑐 min − 2 𝑑 log ⁡ ( 2 𝐾 𝛿 ) ∑ 𝑟

1 𝑅 GLM 2 2 𝑟 + 4 𝑐 min − 2 𝑑 ∑ 𝑟

1 𝑅 GLM 2 2 𝑟 log ⁡ ( 𝑟 + 1 )

≤ 4 𝑐 min − 2 log ⁡ [ 2 𝐾 𝛿 ( 𝑅 GLM + 1 ) ] ∑ 𝑟

1 𝑅 GLM 𝑑 2 2 𝑟 + 𝑑 ( 𝑑 + 1 ) 2 𝑅 GLM

(EC.11.10)

≤ 𝑐 max ⁡ { 256 𝑑 𝛼 𝜀 2 𝑐 min 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛼 𝜀 ) , 256 𝑑 𝛽 𝜀 2 𝑐 min 2 log ⁡ ( 2 𝐾 𝛿 log 2 ⁡ 16 𝛽 𝜀 ) } + 𝑑 ( 𝑑 + 1 ) 2 𝑅 GLM ,

(EC.11.11)

where 𝑐 is a universal constant. Then, let 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 denoting the minimum gap of the problem instance, we have

𝔼 [ 𝑇 GLM ∣ ℰ ]

𝒪 ( 𝑑 𝑐 min 2 𝜉 − 2 log ⁡ ( 𝐾 𝛿 log 2 ⁡ ( 𝜉 − 2 ) ) + 𝑑 2 log ⁡ ( 𝜉 − 1 ) ) .

(EC.11.12) \Halmos Appendix EC.12Detailed Settings for Synthetic Experiments

We recap the figure here for better clarity.

(a)Synthetic I - Adaptive Setting (b)Synthetic II - Static Setting Figure EC.12.1:Illustration on Synthetic Settings EC.12.1Synthetic I - Adaptive Setting.

First, we randomly sample 𝑚 ~ , representing the number of 𝜀 -best arms, from a distribution with an expected value of 𝑚 (used as input for a top 𝑚 algorithm). Next, we randomly sample 𝑋 ~ , representing the best arm reward minus 𝜀 , from a distribution with an expected value of 𝑋 (used as input for a threshold bandit algorithm). We then assign 𝑚 ~

𝜀 -best arms with expected rewards uniformly distributed between 𝑋 ~ + 𝜀 and 𝑋 ~ . Additionally, we assign ( 1.5 𝑚 − 𝑚 ~ ) arms that are not 𝜀 -best with expected rewards uniformly distributed between 𝑋 ~ and 𝑋 ~ − 𝜀 , as illustrated in Figure EC.12.1.

Based on these designed arm rewards, we define the linear model parameter as:

𝜽

( 𝑋 ~ + 𝜀 , 𝑋 ~ + ( 𝑚 ~ − 1 ) 𝜀 / 𝑚 ~ , … , 𝑋 ~ + 𝜀 / 𝑚 ~ , 0 , … , 0 ) ⊤ .

Arms are 𝑑 -dimensional canonical basis 𝑒 1 , 𝑒 2 , … , 𝑒 𝑑 and ( 1.5 𝑚 − 𝑚 ~ ) additional disturbing arms

𝒙 𝒊

( 𝑋 ~ − ( 1.5 𝑚 − 𝑚 ~ − 𝑖 ) 𝜀 / ( 1.5 𝑚 − 𝑚 ~ ) 𝑋 ~ + 𝜀 , 0 , ⋯ , 0 , 1 − ( 𝑋 ~ − ( 1.5 𝑚 − 𝑚 ~ − 𝑖 ) 𝜀 / ( 1.5 𝑚 − 𝑚 ~ ) 𝑋 ~ + 𝜀 ) 2 ) ⊤

with 𝑖 ∈ [ 1.5 𝑚 − 𝑚 ~ ] .

In the adaptive setting, pulling one arm can provide information about the distributions of other arms. The optimal policy in this setting should adaptively refine its sampling and stopping strategy based on historical data. This allows the algorithm to focus more on the disturbing arms, making adaptive strategies particularly effective as the algorithm progresses. In our experiments, we set 𝑚

4 , 𝑋

1 , with 𝑑

10 and 𝜀 ∈ { 0.1 , 0.2 , 0.3 } . A total of six different problem instances are evaluated to compare the performance of the algorithms.

EC.12.2Synthetic II - Static Setting.

We consider a static synthetic setting, similar to the one proposed by Xu et al. (2018), where arms are represented as 𝑑 -dimensional canonical basis vectors 𝑒 1 , 𝑒 2 , … , 𝑒 𝑑 . We set the parameter vector 𝜽

( Δ , … , Δ , 0 , … , 0 ) ⊤ , where 𝑚 ~ elements are Δ and 𝑑 − 𝑚 ~ elements are 0. In this setting, 𝔼 [ 𝑚 ~ ]

𝑚 , and only the value 𝑚 is provided as input to top 𝑚 algorithms. Consequently, the true mean values consist of some Δ ’s and some 0 ’s.

If we set 𝜀

Δ / 2 , as Δ approaches 0, it becomes difficult to distinguish between the 𝜀 -best arms and the arms that are not 𝜀 -best. In the static setting, knowledge of the rewards does not alter the sampling strategy, as all arms must be estimated with equal accuracy to effectively differentiate between them. Therefore, a static policy is optimal in this case, and the goal of this setting is to assess the ability of our algorithm to adapt to such static conditions. In our experiment, we set 𝑚

4 with 𝑑 ∈ { 8 , 12 , 16 } and Δ

1 . A total of three different problem instances are evaluated to compare the algorithms.

Appendix EC.13Auxiliary Results

The following lemma shows that matrix inversion reverses the order relation.

Lemma EC.1

(Inversion Reverses Loewner orders) Let 𝐀 , 𝐁 ∈ ℝ 𝑑 × 𝑑 . Suppose that 𝐀 ⪰ 𝐁 and 𝐁 is invertible, we have

𝑨 − 1 ⪯ 𝑩 − 1 .

(EC.13.1) Proof EC.2

Proof. By definition, to show 𝐁 − 1 − 𝐀 − 1 is a positive semi-definite matrix, it suffices to show that ‖ 𝐱 ‖ 𝐁 − 1 2 − ‖ 𝐱 ‖ 𝐀 − 1 2

‖ 𝐱 ‖ 𝐁 − 1 − 𝐀 − 1 2 ≥ 0 for any 𝐱 ∈ ℝ 𝑑 . Then, by the Cauchy-Schwarz inequality,

‖ 𝒙 ‖ 𝑨 − 1 2

⟨ 𝒙 , 𝑨 − 1 𝒙 ⟩ ≤ ‖ 𝒙 ‖ 𝑩 − 1 ‖ 𝑨 − 1 𝒙 ‖ 𝑩 ≤ ‖ 𝒙 ‖ 𝑩 − 1 ‖ 𝑨 − 1 𝒙 ‖ 𝑨

‖ 𝒙 ‖ 𝑩 − 1 ‖ 𝒙 ‖ 𝑨 − 1 .

(EC.13.2)

Hence ‖ 𝐱 ‖ 𝐀 − 1 ≤ ‖ 𝐱 ‖ 𝐁 − 1 for all 𝐱 , which completes the lemma. \Halmos

The following lemma establishes an upper bound on the ratio between two optimization problems that incorporate instance-specific information from the bandit setting.

Lemma EC.3

We always have 𝒜 𝐼

[ 𝐾 ] , i.e., the entire set of arms is under consideration. For any arm 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀 ∖ { 1 } , we have

min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2 min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2 ≤ 𝑑 𝐿 1 𝒢 𝒴 2 𝐿 2 .

(EC.13.3) Proof EC.4

Proof. For any arm 𝑖 ∈ [ 𝐾 ] , from a perspective of geometry quantity, let conv( 𝒜 ∪ − 𝒜 ) denote the convex hull of symmetric 𝒜 ∪ − 𝒜 . Then for any set 𝒴 ⊂ ℝ 𝑑 define the gauge of 𝒴 as

𝒢 𝒴

max { 𝑐

0 : 𝑐 𝒴 ⊆ conv( 𝒜 ∪ − 𝒜 ) } .

(EC.13.4)

We then provide a natural upper bound for min 𝐩 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝐲 1 , 𝑖 ‖ 𝐕 𝐩 − 1 2 , given by

min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2
≤ min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑦 ∈ 𝒴 ( 𝒜 𝐼 ) ⁡ ‖ 𝒚 ‖ 𝑽 𝒑 − 1 2

1 𝒢 𝒴 2 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑦 ∈ 𝒴 ( 𝒜 𝐼 ) ⁡ ‖ 𝒚 𝒢 𝒴 ‖ 𝑽 𝒑 − 1 2

≤ 1 𝒢 𝒴 2 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝒂 ∈ conv( 𝒜 ∪ ⁣ − 𝒜 ) ⁡ ‖ 𝒂 ‖ 𝑽 𝒑 − 1 2

1 𝒢 𝒴 2 min 𝒑 ∈ 𝑆 𝐾 ⁡ max 𝑖 ∈ 𝒜 𝐼 ⁡ ‖ 𝒂 𝑖 ‖ 𝑽 𝒑 − 1 2

≤ 𝑑 𝒢 𝒴 2 ,

(EC.13.5)

where the third line follows from the fact that the maximum value of a convex function on a convex set must occur at a vertex. With the Kiefer-Wolfowitz Theorem for the G-optimal design, the last inequality is achieved.

Furthermore, for any arm 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } , we have

min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2
≥ min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ eig min ( 𝑽 𝒑 − 1 ) ‖ 𝒚 1 , 𝑖 ‖ 2 2

min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ 1 eig max ( 𝑽 𝒑 ) ‖ 𝒚 1 , 𝑖 ‖ 2 2

≥ 1 max 𝑖 ∈ 𝒜 𝐼 ⁡ ‖ 𝒂 𝑖 ‖ 2 min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 2 2 ,

(EC.13.6)

where eig max ( ⋅ ) and eig min ( ⋅ ) are respectively the largest and smallest eigenvalues of a matrix. The first line follows from the Rayleigh Quotient and Rayleigh Theorem. The last line is derived by the relationship eig max ( 𝐕 𝐩 ) ≤ max 𝑖 ∈ 𝒜 𝐼 ⁡ ‖ 𝐚 𝑖 ‖ 2 . Recall the assumption in Theorem 4.2 that min 𝑖 ∈ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝐚 1 − 𝐚 𝑖 ‖ 2 ≥ 𝐿 2 and the assumption in Section 2 that ‖ 𝐚 𝑖 ‖ 2 ≤ 𝐿 1 for ∀ 𝑖 ∈ [ 𝐾 ] , we have

min 𝒑 ∈ 𝑆 𝐾 ⁡ min 𝑖 ∈ 𝒜 𝐼 ∩ 𝐺 𝜀
{ 1 } ⁡ ‖ 𝒚 1 , 𝑖 ‖ 𝑽 𝒑 − 1 2 ≥ 𝐿 2 𝐿 1 .

(EC.13.7)

Finally, combining inequalities (EC.4) and (EC.13.7) completes the lemma. \Halmos

Lemma EC.5 (Kiefer and Wolfowitz (1960))

If the arm vectors 𝐚 ∈ 𝒜 span ℝ 𝑑 , then for any probability distribution 𝜋 ∈ 𝒫 ( 𝒜 ) , the following statements are equivalent:

1.

𝜋 ∗ minimizes the function 𝑔 ( 𝜋 )

max 𝒂 ∈ 𝒜 ⁡ ‖ 𝒂 ‖ 𝑽 ( 𝜋 ) − 1 2 .

2.

𝜋 ∗ maximizes the function 𝑓 ( 𝜋 )

log det 𝑽 ( 𝜋 ) .

3.

𝑔 ( 𝜋 ∗ )

𝑑 .

Additionally, there exists a 𝜋 ∗ of 𝑔 ( 𝜋 ) such that the size of its support, | Supp ( 𝜋 ∗ ) | , is at most 𝑑 ( 𝑑 + 1 ) / 2 .

\c@NAT@ctr Appendix References Abbasi-Yadkori et al. (2011) ↑ Abbasi-Yadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. Abe and Long (1999) ↑ Abe N, Long PM (1999) Associative reinforcement learning using linear probabilistic concepts. ICML, 3–11 (Citeseer). Ahn and Shin (2020) ↑ Ahn D, Shin D (2020) Ordinal optimization with generalized linear model. 2020 Winter Simulation Conference (WSC), 3008–3019 (IEEE). Ahn et al. (2024) ↑ Ahn D, Shin D, Zeevi A (2024) Feature misspecification in sequential learning problems. Management Science 0(0):null. Azizi et al. (2021a) ↑ Azizi MJ, Kveton B, Ghavamzadeh M (2021a) Fixed-budget best-arm identification in contextual bandits: A static-adaptive algorithm. CoRR abs/2106.04763. Azizi et al. (2021b) ↑ Azizi MJ, Kveton B, Ghavamzadeh M (2021b) Fixed-budget best-arm identification in structured bandits. arXiv preprint arXiv:2106.04763 . Chaloner and Verdinelli (1995) ↑ Chaloner K, Verdinelli I (1995) Bayesian experimental design: A review. Statistical science 273–304. Chapelle and Li (2011) ↑ Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. Advances in neural information processing systems 24. Fiez et al. (2019) ↑ Fiez T, Jain L, Jamieson KG, Ratliff L (2019) Sequential experimental design for transductive linear bandits. Advances in neural information processing systems 32. Filippi et al. (2010) ↑ Filippi S, Cappe O, Garivier A, Szepesvári C (2010) Parametric bandits: The generalized linear case. Advances in neural information processing systems 23. Gabillon et al. (2012) ↑ Gabillon V, Ghavamzadeh M, Lazaric A (2012) Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems 25. Garivier and Kaufmann (2016) ↑ Garivier A, Kaufmann E (2016) Optimal best arm identification with fixed confidence. Conference on Learning Theory, 998–1027 (PMLR). Ghosh et al. (2017) ↑ Ghosh A, Chowdhury SR, Gopalan A (2017) Misspecified linear bandits. Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Hoffman et al. (2014) ↑ Hoffman M, Shahriari B, Freitas N (2014) On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. Artificial Intelligence and Statistics, 365–374 (PMLR). Kaufmann et al. (2016) ↑ Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research 17:1–42. Kaufmann and Koolen (2021) ↑ Kaufmann E, Koolen WM (2021) Mixture martingales revisited with applications to sequential tests and confidence intervals. The Journal of Machine Learning Research 22(1):11140–11183. Kiefer and Wolfowitz (1960) ↑ Kiefer J, Wolfowitz J (1960) The equivalence of two extremum problems. Canadian Journal of Mathematics 12:363–366. Kveton et al. (2023) ↑ Kveton B, Zaheer M, Szepesvari C, Li L, Ghavamzadeh M, Boutilier C (2023) Randomized exploration in generalized linear bandits. Lattimore and Szepesvári (2020) ↑ Lattimore T, Szepesvári C (2020) Bandit algorithms (Cambridge University Press). Lattimore et al. (2020) ↑ Lattimore T, Szepesvari C, Weisz G (2020) Learning with good feature representations in bandits and in rl with a generative model. Li et al. (2010) ↑ Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web, 661–670, WWW ’10 (New York, NY, USA: Association for Computing Machinery), ISBN 9781605587998. McCullagh (2019) ↑ McCullagh P (2019) Generalized linear models (Routledge). Qin and You (2025) ↑ Qin C, You W (2025) Dual-directed algorithm design for efficient pure exploration. Operations Research . Réda et al. (2021) ↑ Réda C, Tirinzoni A, Degenne R (2021) Dealing with misspecification in fixed-confidence linear top-m identification. Advances in Neural Information Processing Systems 34:25489–25501. Russo et al. (2018) ↑ Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z, et al. (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11(1):1–96. Soare et al. (2014) ↑ Soare M, Lazaric A, Munos R (2014) Best-arm identification in linear bandits. Advances in Neural Information Processing Systems 27. Wang et al. (2021) ↑ Wang PA, Tzeng RC, Proutiere A (2021) Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems 34:5810–5821. Xu et al. (2018) ↑ Xu L, Honda J, Sugiyama M (2018) A fully adaptive algorithm for pure exploration in linear bandits. International Conference on Artificial Intelligence and Statistics, 843–851 (PMLR). Yang and Tan (2021) ↑ Yang J, Tan VY (2021) Minimax optimal fixed-budget best arm identification in linear bandits. arXiv preprint arXiv:2105.13017 . Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Notations. In this paper, we denote the set of positive integers up to 𝑁 by [ 𝑁 ]

KL ​ ( 𝑃 , 𝑄 )

( 𝜇 1 , 𝜇 2 , … , 𝜇 𝐾 ) , which can only be estimated through bandit feedback from the selected arms. Without loss of generality, we assume 𝜇 1 > 𝜇 2 ≥ … ≥ 𝜇 𝐾 . The gap Δ 𝑖

𝑋 𝑡

where 𝜇 𝐴 𝑡

At each time step 𝑡 , the stopping rule 𝜏 𝛿 determines whether to continue or stop the process. If the process continues, an arm is selected according to the sampling rule, and the corresponding random reward is observed. When the process stops at 𝑡

𝑀 ≔ { 𝝁 ∈ ℝ 𝐾 | ∃ 𝜽 ∈ ℝ 𝑑 , 𝝁

ℙ 𝝁 ​ ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

ℙ 𝝁 ​ ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

𝜽 ^ 𝑡

𝑽 𝑡 − 1 ​ ∑ 𝑛

where 𝑽 𝑡

∑ 𝑛

𝒞 𝑡 , 𝛿

where the anytime confidence bound 𝐵 𝑡 , 𝛿 is given by 𝐵 𝑡 , 𝛿

⟨ 𝜽 , 𝒂 ⟩

⟨ 𝜽 , 𝑩 ​ 𝑩 ⊤ ​ 𝒂 ⟩

⟨ 𝑩 ⊤ ​ 𝜽 , 𝑩 ⊤ ​ 𝒂 ⟩

Equation (12) ensures that the mean values of all arms remain unchanged under the projection. The first equality holds because 𝑩 ​ 𝑩 ⊤ is a projection matrix and 𝒂 lies in the subspace spanned by 𝑩 (i.e., 𝑩 ​ 𝑩 ⊤ ​ 𝒂

𝑔 ​ ( 𝜋 )

where 𝑽 ​ ( 𝜋 )

𝑔 𝒳 ​ 𝒴 ​ ( 𝜋 )

For any 𝛍 ∈ 𝑀 , there exists a set 𝒳

Γ 𝝁 ∗

In Proposition 3.1, 𝒮 𝐾 denotes the 𝐾 -dimensional probability simplex, and 𝒳

For example, in the case of identifying the best arm, the culprit set is given by 𝒳

{ 𝑖 : 𝑖 ∈ [ 𝐾 ] ∖ 𝑖 ∗ } , where 𝑖 ∗ denotes the unique best arm, and each subquery involves distinguishing every arm from the best arm. In the threshold bandit problem, the culprit set consists of all arms, 𝒳

{ 𝑖 : 𝑖 ∈ [ 𝐾 ] } , where each subquery requires accurately determining whether each arm exceeds the threshold. For the task of identifying the best 𝑚 arms, the culprit set is 𝒳

Consider a set of arms where arm 𝑖 follows a normal distribution 𝒩 ​ ( 𝜇 𝑖 , 1 ) , where 𝜇 𝑖

inf 𝐴 ​ 𝑙 ​ 𝑔 ​ 𝑜 ∈ ℋ 𝔼 𝝁 ​ [ 𝜏 𝛿 ] log ⁡ ( 1 2.4 ​ 𝛿 ) ≥ ( Γ 𝝁 ∗ ) − 1

where 𝒳

We note that the stochastic multi‑armed bandit problem is a special case of the linear bandit problem. By setting 𝒜

0 and redefining the culprit set as 𝒳 ​ ( 𝛍 )

2 − 𝑟

{ 𝑇 𝑟 ​ ( 𝒂 )

𝑇 𝑟

{ 𝑇 𝑟

𝑇 𝑟 ​ ( 𝒂 )

𝜽 ^ 𝑟

𝑽 𝑟 − 1 ​ ∑ 𝑠

where 𝑽 𝑟

𝜇 ^ 𝑖

𝜇 ^ 𝑖 ​ ( 𝑟 )

1:Input: Projected active set 𝒜 𝐼 ​ ( 𝑟 − 1 ) , estimator ( 𝜇 ^ 𝑖 ​ ( 𝑟 ) ) 𝑖 ∈ 𝒜 𝐼 ​ ( 𝑟 − 1 ) , round 𝑟 , 𝜀 , confidence radius 𝐶 𝛿 / 𝐾 ​ ( 𝑟 ) . 2:Let 𝑈 𝑟

max 𝑖 ∈ 𝒜 𝐼 ​ ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 + 𝐶 𝛿 / 𝐾 ​ ( 𝑟 ) − 𝜀 . 3:Let 𝐿 𝑟

1:Input: 𝜀 , 𝛿 , bandit instance. 2:Initialize 𝐺 0

∅ , the set of good arms, and 𝐵 0

∅ , the set of bad arms. 3:Initialize the active set 𝒜 ​ ( 0 )

𝒜 , and 𝒜 𝐼 ​ ( 0 )

[ 𝐾 ] . 4:for 𝑟

1 , 2 , … do 5: Set 𝐶 𝛿 / 𝐾 ​ ( 𝑟 )

𝜀 𝑟

2 − 𝑟 . 6: Set 𝐺 𝑟

𝐺 𝑟 − 1 and 𝐵 𝑟

For 𝜉

𝔼 ​ [ 𝑇 𝐺 ∣ ℰ ]

𝔼 ​ [ 𝑇 𝒳 ​ 𝒴 ∣ ℰ ]

where 𝜉

min ⁡ ( 𝛼 𝜀 , 𝛽 𝜀 ) / 16 is the minimum gap of the problem instance, 𝑅 upper

𝑋 𝑡

𝑀 ≔ { 𝝁 ∈ ℝ 𝐾 | ∃ 𝜽 ∈ ℝ 𝑑 , ∃ 𝚫 𝑚 ∈ ℝ 𝐾 , 𝝁

where 𝑵 𝑡

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 ​ 𝑑 , 𝛽 𝜀 2 ​ 𝑑 } . For 𝜉

𝔼 𝝁 ​ [ 𝑇 𝐺 mis ∣ ℰ ]

The proof of this theorem is provided in Section EC.8. The upper bound in Theorem 4.1, which assumes no model misspecification, can be viewed as a special case of Theorem 5.1 by setting 𝐿 𝑚

Specifically, we show that any mean vector 𝝁

𝚿 ​ 𝜽 + 𝚫 𝑚 can be equivalently expressed at any time 𝑡 as 𝝁

𝜽 𝑡

( 𝚿 𝑁 𝑡 ⊤ ​ 𝚿 𝑁 𝑡 ) − 1 ​ 𝚿 𝑁 𝑡 ⊤ ​ 𝑫 𝑁 𝑡 1 / 2 ​ 𝝁

𝑽 𝑡 − 1 ​ ∑ 𝑠

is the orthogonal projection of 𝝁 onto the feature space spanned by the columns of 𝚿 𝑁 𝑡 , and 𝚫 𝒎 ​ ( 𝒕 )

𝝁 − 𝚿 ​ 𝜽 𝑡 is the residual. Here, 𝚿 𝑁 𝑡

{ 𝑇 𝑟 ​ ( 𝒂 )

𝑇 𝑟

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 ​ ( 𝑑 + 2 ) , 𝛽 𝜀 2 ​ ( 𝑑 + 2 ) } . For 𝜉

𝔼 𝝁 ​ [ 𝑇 𝐺 op ∣ ℰ ]

𝜀 𝑟 ′

𝔼 ​ [ 𝑋 𝑡 ∣ 𝐴 𝑡 ]

exp ⁡ ( 𝑥 ) / ( 1 + exp ⁡ ( 𝑥 ) ) , i.e., sigmoid function, leading to the logistic regression model. For integer-valued rewards, 𝜇 link ​ ( 𝑥 )

𝑝 𝜔 ​ ( 𝑥 )

KL ( 𝑃 , 𝑄 )

ℙ 𝝁 ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

ℙ 𝝁 ( 𝜏 𝛿 < ∞ , ℐ ^ 𝜏 𝛿

𝑽 𝑡 − 1 ∑ 𝑛

⟨ 𝜽 , 𝑩 𝑩 ⊤ 𝒂 ⟩

⟨ 𝑩 ⊤ 𝜽 , 𝑩 ⊤ 𝒂 ⟩

Equation (12) ensures that the mean values of all arms remain unchanged under the projection. The first equality holds because 𝑩 𝑩 ⊤ is a projection matrix and 𝒂 lies in the subspace spanned by 𝑩 (i.e., 𝑩 𝑩 ⊤ 𝒂

𝑔 ( 𝜋 )

where 𝑽 ( 𝜋 )

𝑔 𝒳 𝒴 ( 𝜋 )

Consider a set of arms where arm 𝑖 follows a normal distribution 𝒩 ( 𝜇 𝑖 , 1 ) , where 𝜇 𝑖

inf 𝐴 𝑙 𝑔 𝑜 ∈ ℋ 𝔼 𝝁 [ 𝜏 𝛿 ] log ⁡ ( 1 2.4 𝛿 ) ≥ ( Γ 𝝁 ∗ ) − 1

0 and redefining the culprit set as 𝒳 ( 𝛍 )

{ 𝑇 𝑟 ( 𝒂 )

𝑇 𝑟 ( 𝒂 )

𝑽 𝑟 − 1 ∑ 𝑠

𝜇 ^ 𝑖 ( 𝑟 )

1:Input: Projected active set 𝒜 𝐼 ( 𝑟 − 1 ) , estimator ( 𝜇 ^ 𝑖 ( 𝑟 ) ) 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) , round 𝑟 , 𝜀 , confidence radius 𝐶 𝛿 / 𝐾 ( 𝑟 ) . 2:Let 𝑈 𝑟

max 𝑖 ∈ 𝒜 𝐼 ( 𝑟 − 1 ) ⁡ 𝜇 ^ 𝑖 + 𝐶 𝛿 / 𝐾 ( 𝑟 ) − 𝜀 . 3:Let 𝐿 𝑟

∅ , the set of bad arms. 3:Initialize the active set 𝒜 ( 0 )

𝒜 , and 𝒜 𝐼 ( 0 )

1 , 2 , … do 5: Set 𝐶 𝛿 / 𝐾 ( 𝑟 )

𝔼 [ 𝑇 𝐺 ∣ ℰ ]

𝔼 [ 𝑇 𝒳 𝒴 ∣ ℰ ]

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 𝑑 , 𝛽 𝜀 2 𝑑 } . For 𝜉

𝔼 𝝁 [ 𝑇 𝐺 mis ∣ ℰ ]

𝚿 𝜽 + 𝚫 𝑚 can be equivalently expressed at any time 𝑡 as 𝝁

( 𝚿 𝑁 𝑡 ⊤ 𝚿 𝑁 𝑡 ) − 1 𝚿 𝑁 𝑡 ⊤ 𝑫 𝑁 𝑡 1 / 2 𝝁

𝑽 𝑡 − 1 ∑ 𝑠

is the orthogonal projection of 𝝁 onto the feature space spanned by the columns of 𝚿 𝑁 𝑡 , and 𝚫 𝒎 ( 𝒕 )

𝝁 − 𝚿 𝜽 𝑡 is the residual. Here, 𝚿 𝑁 𝑡

{ 𝑇 𝑟 ( 𝒂 )

Fix 𝜀 > 0 and suppose that the magnitude of misspecification satisfies 𝐿 𝑚 < min ⁡ { 𝛼 𝜀 2 ( 𝑑 + 2 ) , 𝛽 𝜀 2 ( 𝑑 + 2 ) } . For 𝜉

𝔼 𝝁 [ 𝑇 𝐺 op ∣ ℰ ]

𝔼 [ 𝑋 𝑡 ∣ 𝐴 𝑡 ]

exp ⁡ ( 𝑥 ) / ( 1 + exp ⁡ ( 𝑥 ) ) , i.e., sigmoid function, leading to the logistic regression model. For integer-valued rewards, 𝜇 link ( 𝑥 )

𝑝 𝜔 ( 𝑥 )

𝑏 ˙ ( 𝜔 ) and Var ( 𝑋 )

𝑏 ¨ ( 𝜔 ) , where 𝑏 ˙ and 𝑏 ¨ denote the first and second derivatives of 𝑏 , respectively. Since the variance is always positive and 𝜇 link

The canonical GLM assumes that 𝑝 𝜽 ( 𝑋 ∣ 𝒂 𝑖 )

𝑝 𝒂 𝑖 ⊤ 𝜽 ( 𝑋 ) for all arms 𝑖 . The maximum likelihood estimator 𝜽 ^ 𝑡 , based on 𝜎 -algebra ℱ 𝑡

1 𝑡 log ⁡ 𝑝 𝜽 ( 𝑋 𝑠 | 𝒂 𝐴 𝑠 )

1 𝑡 ( 𝑋 𝑠 − 𝜇 link ( 𝒂 𝐴 𝑠 ⊤ 𝜽 ) ) 𝒂 𝐴 𝑠

{ 𝑇 𝑟 ( 𝒂 )

𝔼 [ 𝑇 GLM ∣ ℰ ]

Table 1:Synthetic Experiment Settings Setting Index Setting Category Setting Details 1 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

2 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

3 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

4 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

5 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

6 Adaptive ( 𝑑 , 𝔼 [ 𝑚 ] )

( 𝑑 , 𝔼 [ 𝑚 ] )

( 8 , 4 )
( 𝑑 , 𝔼 [ 𝑚 ] )

( 12 , 4 )
Δ

0.1
𝜀

0.2
𝜀

0.3
𝜀

0.1
𝜀

0.2
𝜀

0.3
𝑑

8
𝑑

12
𝑑

∪ 𝑥 ∈ 𝒳 ( 𝝁 ) Alt 𝑥 ( 𝝁 )

As an example, consider the task of identifying the single best arm. In this case, the culprit set is 𝒳 ( 𝝁 )

where KL ( ⋅ , ⋅ ) represents the KL divergence of the two distributions parameterized by their means, and 𝑁 𝑖 ( 𝑡 )

𝐶 𝑥 ( 𝒑 )

max 𝒑 ∈ 𝒮 𝐾 inf 𝜗 ∈ Alt ⁡ ( 𝝁 ) ∑ 𝑖 ∈ [ 𝐾 ] 𝑝 𝑖 KL ( 𝜇 𝑖 , 𝜗 𝑖 )

(Wang et al. 2021) For each problem instance 𝝁 ∈ 𝑀 , the alternative set Alt ( 𝝁 ) is a finite union of convex sets. Namely, there exists a finite collection of convex sets { Alt 𝑥 ( 𝝁 ) : 𝑥 ∈ 𝒳 ( 𝝁 ) } such that Alt ( 𝝁 )

Alt 𝑥 ( 𝝁 )