128 kB

Title: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning

URL Source: https://arxiv.org/html/2301.13287

Markdown Content: Milo : Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning Krishnateja Killamsetty1, 2, , Alexandre V. Evfimievski2, Tejaswini Pedapati 2 Kiran Kate2 Lucian Popa2, Rishabh Iyer1 1 The University of Texas at Dallas 2 IBM Research {krishnateja.killamsetty, rishabh.iyer}@utdallas.edu {evfimi, tejaswinip, kakate, lpopa}@us.ibm.com A portion of this work was completed while Krishnateja was an intern at IBM Research. Abstract

Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose Milo, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that Milo can train models 3 × − 10 × faster and tune hyperparameters 20 × − 75 × faster than full-dataset training or tuning without compromising performance.

\doparttoc\faketableofcontents 1 Introduction

Deep learning has achieved remarkable success in a multitude of machine learning tasks, including natural language processing, computer vision, and speech recognition in recent years. This success is partially due to the availability of massive training datasets and the capacity to train large-scale neural networks. However, training deep models on extensive datasets is computationally demanding, incurring significant financial costs and generating substantial CO2 emissions (Strubell et al., 2019; Schwartz et al., 2020). Bhavya et al. (Bartoldson et al., 2022) overview several research trajectories aimed at enhancing model convergence and reducing training time and costs, including data subset selection, curriculum learning, model architecture improvements, and optimization algorithm enhancements. Our work specifically focuses on selecting useful, generalizable data subsets for efficient deep neural network training.

Recent research (Toneva et al., 2019; Birodkar et al., 2019; Katharopoulos and Fleuret, 2018) suggests that many current training datasets are redundant, indicating that a non-redundant, informative subset could achieve comparable performance to training on the full dataset. To identify such informative samples, metrics such as prediction uncertainty (Coleman et al., 2020), prediction flips (Toneva et al., 2019; Zhou et al., 2020), loss (Loshchilov and Hutter, 2015), or gradient/gradient-norm (Katharopoulos and Fleuret, 2018; Mirzasoleiman et al., 2020; Killamsetty et al., 2021c, b, d) have been applied. These metrics, calculated using either the downstream machine learning model or a lightweight surrogate (Coleman et al., 2020), require a fully converged model, a requirement that runs counter to the goal of efficient training. To address this, existing subset selection algorithms (Mirzasoleiman et al., 2020; Killamsetty et al., 2021c, b, d, 2022; Pooladzandi et al., 2022) for efficient learning utilize the downstream model during training for heuristic computation and periodically update the subset as the model-dependent metrics for each sample evolve.

(a) (b) Figure 1: Sub-figures (a) and (b) display the convergence of the ResNet18 model trained on a 10% subset selected using Adaptive-Random, CraigPB, and GradMatchPB on the CIFAR100 dataset, with respect to epochs and time, respectively. Each strategy selects a new subset every epoch.

Drawbacks of Model-Dependent Subset Selection: Despite the theoretical advantages of existing subset selection strategies(Killamsetty et al., 2021c, b; Mirzasoleiman et al., 2020), they often fall short in computational efficiency compared to adaptive random subset selection, which selects random subsets periodically. This is primarily because traditional approaches rely on downstream models and typically require the calculation of sample metrics, such as gradients, before each subset selection step. Furthermore, these computationally demanding subset selection steps occur during model training. For instance, Figure 1 compares the convergence rate of the ResNet18 model on the CIFAR100 dataset, in terms of both time and epochs. Here, the model uses 10% subsets chosen every epoch by GradMatchPB (Killamsetty et al., 2021b), a state-of-the-art data subset selection strategy for efficient training, CraigPB (Mirzasoleiman et al., 2020), and Adaptive-Random (where a new 10% subset is randomly selected periodically). The selection of a new subset every epoch is done to demonstrate the maximum performance attainable by GradMatchPB and CraigPB. The results indicate that GradMatchPB achieves faster epoch convergence than both Adaptive-Random and CraigPB when selecting a new subset each epoch. However, due to the necessity for a computationally intensive subset selection step every epoch, both GradMatchPB and CraigPB demonstrate considerable inefficiency in terms of training time. In their respective studies, Killamsetty et al. (Killamsetty et al., 2021b) and Mirzasoleiman et al. (Mirzasoleiman et al., 2020) suggested selecting a new subset every 𝑅 epochs to enhance training efficiency. However, this comes at the cost of the model’s convergence rate. Lastly, model-dependent subset selection requires a computationally intensive subset selection steps each time a new model is trained.

Model-Agnostic Selection: Our primary insight is that a model-agnostic subset selection framework can circumvent the computationally intensive subset selection steps during model training by conducting subset selection in the preprocessing phase. By pre-selecting subsets and storing them as metadata with each dataset, we can train numerous models without incurring additional costs, effectively spreading out the expense of subset selection. In this work, we endeavor to answer the following question: Is it possible to develop a model-agnostic subset selection method that selects new subsets in a minimal amount of time, yet achieves superior model convergence without significant compromise in test accuracy or generalization performance?

1.1 Contributions

Milo Framework: In this study, we introduce Milo, a model-agnostic subset selection framework designed for efficient model training and tuning. Milo employs submodular measures (Fujishige, 2005; Kaushal et al., 2021), which capture higher-order interactions between data samples for subset selection. We utilize pre-trained large language models (Qiu et al., 2020) and pre-trained vision transformers (Khan et al., 2022) as feature encoders due to their zero-shot feature encoding capabilities. These encoders compute the sample metrics in a nominal amount of time, and this computation is independent of the downstream model, rendering Milo model-agnostic. Figure 3 provides a visual representation of Milo for model training, comprising two steps: a) A pre-processing step that involves the selection of multiple subsets from the training dataset using "Stochastic-Greedy Exploration (SGE)" (refer to Section 3.1.1), and the construction of a probability distribution over the entire dataset for subset sampling through "Weighted Random Exploration (WRE)" (refer to Section 3.1.2); b) A model training scheme that involves training the model on a curriculum of easy-to-hard subsets (refer to Section 3.1.3) using the subsets selected and the probability distribution constructed in the pre-processing step. We also present a class-wise partitioning trick to significantly minimize the memory footprint of Milo (refer to Section 3.2).

﹈

(a) Efficient Training ﹈

(b) Efficient Tuning \phantomcaption Figure 2: This figure illustrates the Milo’s tradeoff between speedup and accuracy degradation compared to full data training and tuning. For model training, speedups of 3 × to 10 × were achieved with less than a 1.5 % accuracy drop, and for hyper-parameter tuning, Milo achieves speedups of 20 × to 75 × with less than a 0.15 % accuracy drop.

Effectiveness of Milo : Through extensive experiments on multiple real-world datasets, we empirically demonstrate the effectiveness of the Milo framework for efficient training and hyperparameter tuning. In Figure 2, we provide a summary of the speedup achieved by Milo compared to full data training and tuning, along with the corresponding relative performance. Our results show that Milo can train models 3x to 10x faster and tune hyperparameters 20x to 75x faster, with minimal loss in performance. Furthermore, we consistently outperform the subset selection baselines that were considered. Additionally, Milo exhibits significantly faster initial model convergence and maintains faster convergence throughout training compared to other baselines. These qualities make it highly effective for applications such as hyperparameter tuning and neural architecture search, where the ability to distinguish good models from bad models early in the tuning process is crucial for enhancing efficiency (Li et al., 2017, 2020).

Related Work:

In recent years, data-subset selection strategies have achieved success in various machine learning applications such as speech recognition (Wei et al., 2014b, a), machine translation (Kirchhoff and Bilmes, 2014), active learning (Sener and Savarese, 2018; Ash et al., 2020; Kothawade et al., 2021), hyper-parameter tuning (Killamsetty et al., 2022), continual learning (Tiwari et al., 2021), domain adaptation (Karanam et al., 2022), and computer vision (Kaushal et al., 2019). A recent empirical study (Birodkar et al., 2019) demonstrated that datasets frequently contain semantic redundancies. These redundancies can be removed during training without affecting performance. In response, data pruning methods (Toneva et al., 2019; Paul et al., 2021; Sorscher et al., 2022) have developed principled selection criteria for pruning redundant samples before model training, with minimal loss in performance. However, both existing data pruning and compute-efficient learning approaches are model-dependent, resulting in them being computationally expensive. In contrast, our method is model-agnostic and capable of identifying high-quality subsets by intelligent data exploration. Furthermore, our approach can achieve superior model convergence by utilizing an easy-to-hard curriculum, transitioning from representative to diverse subsets during training.

2 Preliminaries

Notation: We briefly describe the notation for various variables that will be used throughout the remainder of this work. Denote the training dataset as 𝒟

{ ( 𝑥 𝑗 , 𝑦 𝑗 ) } 𝑗

1 𝑚 with 𝑚 data points. Let 𝒮 be the subset of the training dataset on which the downstream model is trained. Let the feature encoder be denoted as 𝑔 : 𝑋 → 𝑍 , which transforms the input from the feature space 𝑋 to an embedding space 𝑍 . Let the downstream model parameters be characterized by 𝜽 . We subscript the changing variables, such as model parameters 𝜽 and subset 𝒮 , with the timestep 𝑡 to denote their specific values at that timestep. Finally, let the subset size be denoted as 𝑘 .

Subset Selection Formulation: The standard subset selection problem can be formulated as the maximization of a set function 𝑓 subject to a budget constraint 𝑘 :

𝒮 *

arg ⁢ max 𝒮 : 𝒮 ⊆ 𝒟 , | 𝒮 |

𝑘 ⁢ 𝑓 ⁢ ( 𝒮 ) (1)

Submodular Functions: Given a set function 𝑓 : 2 𝒟 → ℝ that maps sets to real values. For a set 𝒜 ⊆ 𝒟 , 𝑓 ⁢ ( 𝒜 ) gives a real-valued score for 𝒜 . Formally, a function 𝑓 is defined as submodular (Fujishige, 2005) if, for any 𝑥 ∈ 𝒰 , the inequality 𝑓 ⁢ ( 𝒜 ∪ 𝑥 ) − 𝑓 ⁢ ( 𝒜 ) ≥ 𝑓 ⁢ ( ℬ ∪ 𝑥 ) − 𝑓 ⁢ ( ℬ ) holds for all 𝒜 ⊆ ℬ ⊆ 𝒟 and 𝑥 ∉ ℬ . Submodular gain, 𝑓 ⁢ ( 𝑥 | 𝒟 ) , is defined as the difference between the set function value of the set 𝒟 ∪ 𝑥 and the set function value of set 𝒟 , i.e., 𝑓 ⁢ ( 𝑥 | 𝒟 )

𝑓 ⁢ ( 𝒟 ∪ 𝑥 ) − 𝑓 ⁢ ( 𝒟 ) . A function 𝑓 is considered monotone if 𝑓 ⁢ ( 𝒜 ) ≤ 𝑓 ⁢ ( ℬ ) for any 𝒜 ⊆ ℬ . Although most general discrete maximization problems are NP-hard, they can be approximately solved if the set function 𝑓 being maximized is submodular. Furthermore, if the set function 𝑓 is monotone submodular, then the optimization problem in Equation 1 can be approximated to a factor of 1 − 1 𝑒 (Nemhauser et al., 1978) when maximized under a cardinality constraint using a simple greedy algorithm.

Figure 3: Block diagram of Milo for model training. 3 Development of Milo

In this section, we discuss the theoretical and empirical considerations that inspired the development of Milo. Our primary objective for this new subset selection approach is to enable the selection of a new subset in a negligible amount of time while ensuring superior model convergence and performance within a specified period.

Choice of Feature Encoders:

The standard subset selection problem as outlined in Equation (1) involves maximizing the set function 𝑓 subject to a budget constraint. To capture interactions between data samples, most set functions 𝑓 necessitate the computation of a similarity kernel 𝒦 (Kaushal et al., 2021). This computation requires informative encodings of samples. Our initial design choice was to utilize pre-existing pre-trained language models or vision transformers as feature encoders 𝑔 . These models enhance contextualization, expressiveness, and generalizability, promoting extrapolation in zero-shot settings (Qiu et al., 2020; Khan et al., 2022). However, this prompts questions regarding their ability to effectively generalize to specialized domain datasets. We address this issue empirically in Appendix H.1, which shows that the pre-trained transformer models used in our study can generalize effectively to such domains. In cases where a pre-trained model underperforms on a specific dataset—determined by linear probing accuracies—one can fine-tune the pre-trained transformer model or train a smaller proxy model (Coleman et al., 2020). Although this adds to pre-processing costs, it boosts performance, and pre-processing only needs to be done once per dataset. The efficacy of using a proxy model is validated in Appendix H.2. By using a pre-trained transformer model or a proxy model, the need for downstream machine-learning models to compute sample representations is eliminated. Further, the effectiveness of various language models or vision transformers as feature encoders for subset selection is assessed in Appendix I.1.

(a) Cifar100 (10%) (b) Cifar100 (30%) Figure 4: ResNet18 model performance on 10% and 30% subsets of the Cifar100 dataset, selected using diverse set functions for maximization. Optimal Subset Composition:

The immediate question that arises is the choice of the appropriate set function for subset selection. Essentially, we need to pinpoint the key characteristics a subset should possess for optimal model performance. To identify these characteristics, we assess the empirical performance of various set functions that capture the following subset qualities: Representation: This attribute measures the subset’s effectiveness in representing the entire dataset. For instance, it ensures that the subset includes more samples from dense regions than sparse ones, typically resulting in "easier" data samples. We used facility location and graph-cut set functions to model representation; Diversity: This characteristic evaluates the degree of distinctiveness among the data samples within the subset. For instance, it ensures that the subset covers highly diverse samples, which are generally "harder" data samples. To model diversity, we considered the disparity-sum and disparity-min set functions. We empirically support the claim that representation-based functions select easier samples, and diversity-based functions select harder samples by presenting the mean EL2N scores (Paul et al., 2021) for the subsets chosen by their respective functions in Appendix E. All set functions considered, except disparity-sum and disparity-min, are submodular. Despite the non-submodular nature of disparity-min and disparity-sum functions, it has been established that they can be effectively maximized using the conventional greedy approach, leading to 1/4 and 1/2 approximations respectively (Dasgupta et al., 2013). Therefore, we have deemed it appropriate to include these functions in our evaluation. We provide instantiations of the considered set functions in Appendix D. Figure 4 demonstrates the performance of a ResNet18 model trained on 10% and 30% fixed subsets of the Cifar100 dataset, respectively, by maximizing different set functions. The results show that when using larger subset sizes of 30% or more, subsets selected with diversity functions disparity-min and disparity-sum lead to superior model performance. Conversely, for smaller subset sizes of 10% or less, subsets chosen with representation functions graph-cut and facility location yield better model performance. This observation is consistent with the recent findings of Sorscher et al. (2022).

Issue with using fixed data subsets:

If the goal is to achieve the best performance within a certain timeframe, the model must also engage in data exploration instead of relying solely on a fixed data subset. One significant disadvantage of training models using fixed data subsets is the requirement for large subsets, approximately 70% or more, to achieve accuracy comparable to full data training. This leads to extended training times. For instance, when training the ResNet101 model on a fixed 10% random subset of the Cifar10 dataset for 200 epochs, it only reached a test accuracy of 66.9%. Conversely, the ResNet101 model achieved a test accuracy of 87.54% when trained on an adaptive 10% subset of Cifar10 data for the same number of epochs, with a new subset being randomly selected after each epoch. While random data exploration is a simple and empirically effective method for data exploration, it may not be the most efficient strategy due to potential redundancy in the randomly selected subsets. Therefore, it is vital to develop a strategy that balances Subset Exploration and Subset Exploitation.

3.1 Informative Data Exploration

To balance exploration and exploitation, models should train on small, informative subsets that encourage exploration. A performant subset selection approach would combine training samples from all parts of the dataset without undue bias. An ideal formulation for informative data subset exploration is as follows:

𝑃 ⁢ ( 𝒮 ) ∝ exp ⁡ ( 𝛽 ⋅ 𝑓 ⁢ ( 𝒮 ) ) ⁢ subject ⁢ to ⁢ | 𝒮 |

𝑘 (2)

In Equation (2), 𝑃 ⁢ ( 𝒮 ) represents the probability of sampling subset 𝒮 , 𝑓 is our set quality measure, and 𝛽 > 0 is the inverse temperature (Gotovos et al., 2015). This formula ensures higher quality subsets are explored more frequently, without showing bias in the points sampled. Ideally, a new subset is selected for every epoch from this probability distribution. However, constructing a simple sampler according to Equation (2) requires a combinatorial number of set function evaluations. Although Gotovos et al. (2015) proposed a Gibbs sampling approach for marginal inference with a polynomial mixing time in | 𝒟 | , to enforce | 𝒮 |

𝑘 , it needs modification, involving swapping data points at each step. But this leads to considerable mixing times, especially when the current subset has a close-to-optimal 𝑓 ⁢ ( 𝒮 ) value. We propose two scalable alternatives for data exploration with varying exploration-to-exploitation ratios and leave the extension of Gotovos et al. (2015) to future work.

3.1.1 Stochastic-Greedy Exploration (SGE)

The first method we use for data exploration involves identifying multiple subsets with high function values. We then train the downstream model on these selected subsets, changing the subsets every 𝑅 epochs. This approach emphasizes exploitation over exploration, as it primarily focuses on subsets with high function values. To select 𝑛 subsets 𝒮 1 , 𝒮 2 , ⋯ , 𝒮 𝑛 from the dataset 𝒟 with high set function values, we employ the stochastic greedy algorithm (Mirzasoleiman et al., 2015) to maximize the set function 𝑓 and repeat the maximization 𝑛 times.

𝒮 1 , 𝒮 2 , ⋯ , 𝒮 𝑛 ← SGE ⁡ ( 𝑓 , 𝒟 , 𝑘 ) (3)

The randomness of the stochastic greedy algorithm allows us to choose a different subset with an approximate guarantee of 𝒪 ⁢ ( 1 − 1 𝑒 − 𝜖 ) every time. Due to space constraints, a detailed pseudocode of the "SGE" is provided in Algorithm 2 in Appendix C. In our experiments, we use an 𝜖 value of 0.01 for stochastic greedy maximization.

3.1.2 Weighted Random Exploration (WRE)

In the WRE approach, we explore data by creating a multinomial probability distribution 𝒑 across the entire dataset 𝒟 . Every 𝑅 epochs, we sample a subset 𝒮 of size 𝑘 from this distribution, without replacement. We employ a weighted random sampling method (Efraimidis and Spirakis, 2016), whereby each data sample is assigned a weight equal to its normalized set function gain during greedy maximization. Specifically, we greedily maximize the set function 𝑓 over the whole dataset 𝒟 and designate the set function gains associated with each data sample 𝑒 at its point of greedy inclusion as its importance score 𝑔 𝑒 . Here, if 𝒮 is the subset selected thus far and 𝑒 is the next optimal data sample to be added greedily, the set function gain value of 𝑒 is computed as 𝑓 ⁢ ( 𝒮 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝒮 ) .

𝒈

[ 𝑔 1 , 𝑔 2 , ⋯ , 𝑔 𝑚 ] ← GreedySampleImportance ⁡ ( 𝑓 , 𝒟 ) (4)

As illustrated in Equation (5), we normalize the importance scores 𝒈 and construct the probability distribution 𝒑 over the training set 𝒟 by employing the second-order Taylor-Softmax function (de Brébisson and Vincent, 2016) over the importance scores.

𝒑

Taylor − Softmax ⁡ ( 𝒈 )

[ 1 + 𝑔 𝑖 + 0.5 ⁢ 𝑔 𝑖 2 ∑ 𝑗

1 𝑚 1 + 𝑔 𝑗 + 0.5 ⁢ 𝑔 𝑗 2 ] 𝑖

1 𝑚 (5)

When 𝑓 is submodular, the diminishing returns property of submodular functions means elements chosen in early iterations have greater set function gain than those selected later. The probability distribution 𝒑 generated assigns higher probability to more informative samples. Depending on the set function, this could mean better representativeness or diversity. Importantly, sampling from 𝒑 allows the exploration of less informative samples while frequently selecting more informative ones. Once 𝒑 is established, selecting new subsets from the multinomial distribution is as quick as random subset selection. We use 𝒑 to sample new subsets of size 𝑘 every 𝑅 epochs (without replacement). For the detailed pseudocode of the greedy sample importance estimation, please refer to Algorithm 3 in Appendix C, as space constraints prevent inclusion here.

(a) Comparison of data exploration approaches (b) ResNet18 convergence on 5% Cifar100 subset Figure 5: Figure (a) illustrates the performance of the ResNet18 model on various subset sizes and set functions using SGE and WRE approaches on Cifar100. Figure (b) demonstrates the convergence of the model on a 5% subset of Cifar100 employing SGE with Graph Cut and WRE with Disparity-Min function.

SGE vs. WRE: Sub-figure 4(a) demonstrates that training the ResNet18 model using the Weighted Random Exploration (WRE) approach, which emphasizes exploration, yields better performance than both the Stochastic-Greedy Exploration (SGE) approach, which prioritizes exploitation, and fixed data subsets (pure exploitation). Furthermore, using Disparity-Min as a set function outperforms other set functions in both exploration strategies. However, we made an important empirical observation: SGE with the Graph-cut function achieves superior initial model convergence in the early iterations, as shown in sub-figure 4(b). This sub-figure emphasizes the benefit of early-stage model convergence of SGE using Graph-cut, focusing on easier samples, compared to WRE with Disparity-Min, focusing on more challenging samples, and both WRE with Graph-cut and SGE using Facility Location, targeting easier samples, on the Cifar100 dataset. Early convergence is particularly beneficial in hyper-parameter tuning or neural-architecture search, where early-stage performance assessment is critical. This trend of superior early convergence with SGE using Graph-cut is consistent across multiple datasets and subset sizes, as shown in Figures 12 and 13 in the Appendix. We provide a detailed explanation in Appendix sections I.3 and I.4, explaining why SGE with Graph-cut results in superior initial convergence compared to SGE with Facility Location and WRE with Graph-cut, even though all these approaches aim to select easy or representative samples.

3.1.3 Developing a Curriculum of Easy-to-Hard Subsets:

Sub-figure 4(b) suggests that for optimal model convergence with Milo throughout training, we should construct an easy-to-hard curriculum, building on the empirical success of such approaches (Lee and Grauman, 2011; Hacohen and Weinshall, 2019; Zhou et al., 2020). This is accomplished by initially using SGE with graph-cut, followed by WRE with disparity-min in later iterations. The curriculum design involves training the model for a fraction 𝜅 of the total number of epochs with SGE and the graph-cut function, followed by WRE with disparity-min for the remaining epochs. The hyper-parameter 𝜅 specifies the fraction of epochs dedicated to stochastic exploration with the graph-cut function. Our experiments determined that setting 𝜅

1 6 yielded optimal results following hyper-parameter tuning. The use of disparity-min in WRE ensures the selection of subsets that include both easy and difficult samples, with a higher probability of selecting difficult samples, and mitigates the model’s catastrophic forgetting of easy samples. The results of hyper-parameter tuning for 𝜅 are presented in Appendix I.5.1. Furthermore, we demonstrate the advantage of curriculum-based data exploration in enhancing model convergence and performance through an ablation study, as detailed in Appendix I.5.

3.2 Milo Framework

As mentioned earlier, the Milo training process follows a curriculum of subsets, transitioning from representative to diverse subsets. Figure 3 illustrates the Milo training pipeline, while Figure 8 in the Appendix depicts the hyper-parameter tuning pipeline with Milo. Detailed pseudocode for the Milo algorithm can be found in Algorithm 1 in Appendix C, as space constraints preclude its inclusion in the main paper. For hyper-parameter search and scheduling algorithms, we used Wandb (Biewald, 2020), SUBMODLIB (Kaushal et al., 2022) for submodular optimization, and CORDS (Killamsetty et al., 2021a) for baseline subset selection methods.

Class-wise Data Partitioning:

The set functions we experiment with in this study necessitate a similarity kernel 𝒦 of size 𝑚 × 𝑚 , which must be computed over the entire dataset of size 𝑚 . This could lead to substantial memory requirements for computation and storage as the dataset size 𝑚 increases, making it computationally prohibitive in some instances. To address this issue, we partition the data based on class labels and select multiple subsets from each class (for SGE) or create a probability distribution over each class (for WRE), considering only data instances from each class separately. For example, in a balanced dataset of size 𝑚 with 𝑐 classes, class-wise partitioning reduces memory requirements by a factor of 𝑐 2 . By default, we use class-wise partitioning and curriculum-based data exploration with Milo.

4 Experimental Results

Our experimental objective is to emphasize the efficiency and stability of Milo in both model training and hyper-parameter tuning. Each experiment is repeated five times, and for clarity, we present only the mean test accuracies in our plots. A comprehensive table, including both the mean test accuracy and standard deviations, can be found in Appendices H.4 and H.6. To ensure a fair comparison, we use the same random seed for all trials across all methods. Implementation specifics, datasets, and baselines utilized in each scenario will be detailed in the following subsections.

(a) CIFAR10 (ResNet18) \phantomcaption (b) CIFAR100 (ResNet101) \phantomcaption (c) TinyImagenet (ResNet101) \phantomcaption (d) ImageNet (ResNet50) \phantomcaption (e) Rotten Tomatoes (LSTM) \phantomcaption (f) IMDB (BERT+MLP) \phantomcaption (g) CIFAR100 Convergence \phantomcaption (h) Trec6 (LSTM) Convergence \phantomcaption Figure 6: Comparison of Milo with baselines for model training using various subset sizes on seven different datasets with different models. The x-axis shows the accuracy degradation compared to full data training, and the y-axis shows the speedup achieved by using the subset selection approach. Smaller subsets are on the right, and larger ones are on the left. Milo significantly outperforms existing baselines in terms of accuracy degradation and speedup tradeoff compared to full data training. The bottom-right corner of each plot shows the best speedup-accuracy tradeoff region. Plots (g) and (h) show the model convergence with time. Again, we see that Milo achieves much faster convergence than all baselines and full training. Subset Selection Baselines:

Our experiments aim to showcase the performance of Milo in model training and hyperparameter tuning scenarios. In single model training experiments, we compare Milo with several strategies. These include: Random, which randomly samples a subset of the same size as Milo from the training data; Adaptive-Random, which adaptively samples a random subset of the same size as Milo from the training data every 𝑅 epochs; Full, which uses the entire training data for model training and tuning; Full-EarlyStop, where early stopping is applied to full training to match the time or energy consumption of Milo; and adaptive gradient-based subset selection strategies for efficient learning, where a new subset is selected every 𝑅 epochs. These include CraigPB, a faster per-batch version of Craig (Mirzasoleiman et al., 2020) as discussed in Killamsetty et al. (2021b), Glister (Killamsetty et al., 2021c), and Grad-MatchPB, the per-batch version of Grad-Match (Killamsetty et al., 2021b). In hyper-parameter tuning experiments, we adopt the experimental setup of Automata (Killamsetty et al., 2022), an efficient hyperparameter tuning framework that uses Grad-MatchPB, replacing the subset selection strategy with Milo. To assess the efficacy of Milo for hyperparameter tuning, we compare it with Random, Full, Adaptive-Random, and Automata (Grad-MatchPB) as subset selection baselines. We also present results from a variant of Milo, denoted Milo (Fixed), which uses a fixed subset for model training, selected by maximizing the disparity-min function.

Datasets, Model Architecture, and Experimental Setup: In our experiments, we utilized various vision and text datasets, namely Cifar100, Cifar10 (Krizhevsky and Hinton, 2009), TinyImageNet (Le and Yang, 2015), ImageNet (Russakovsky et al., 2015), Trec6 (Li and Roth, 2002; Hovy et al., 2001), Imdb (Maas et al., 2011), and Rotten Tomatoes (Pang and Lee, 2005). For datasets without pre-specified validation sets, we created a new validation set by splitting the original training set into a 90% training set and a 10% validation set. Appendix F provides detailed information regarding dataset sizes and splits. For the text datasets, we employed the LSTM model sourced from PyTorch, using trainable GloVe embeddings of 300 dimensions as input. We also utilized the BERT+MLP model, which combines a Bert-Base model with a two-layer MLP for classification. Regarding the vision datasets, we used the ResNet18, ResNet50, and ResNet101 models. We trained the ResNet models for 200 epochs with a batch size of 128, except for ImageNet, where we trained for 90 epochs. The LSTM model was trained for 24 epochs with a batch size of 16, while the BERT model was trained for 12 epochs with a batch size of 16. For image datasets, both Milo and the baselines employed Nesterov’s accelerated SGD optimizer with a learning rate of 0.05, weight decay of 5e-4, momentum of 0.9, and a cosine annealing learning rate scheduler, except for ImageNet where a cyclic learning rate scheduler was used. For text datasets with the LSTM model, we utilized the Adam optimizer with a learning rate of 0.001. For fine-tuning the BERT model, we employed the AdamW optimizer with a learning rate of 5e-5. The hyperparameter search spaces used for the hyperparameter tuning experiments are outlined in Appendix G. All experiments were conducted on 80GB A100 GPUs.

Pre-trained Transformer Models as Feature Encoders: We utilized the Dino-ViTb16 model (Caron et al., 2021) from HuggingFace (Wolf et al., 2019) as the feature encoder for vision datasets. The final layer CLS token embedding output served as the feature representation. For text datasets, we employed the pre-trained all-distilroberta-v1 model from the sentence transformers package (Reimers and Gurevych, 2019), computing the average of the final layer embeddings of all words in the sentence for the feature representation. An evaluation of the performance of pre-trained vision transformers and language models as feature encoders for subset selection can be found in Appendix I.1. Furthermore, we provide empirical evidence in Appendix H.1 demonstrating that the pre-trained transformer models used in this study can effectively generalize to specialized or unseen domain datasets.

Efficacy of Milo for Efficient Model Training: Figure 6 illustrates the accuracy-efficiency tradeoff comparison for various subset selection methods used in model training. The performance is compared across different subset sizes of the training data, including 1%, 5%, 10%, and 30%. In our experiments, both the Milo and Adaptive-Random used a 𝑅 value of 1, implying subset selection at each epoch. It was empirically observed that an 𝑅 value of 1 enhances the performance of the Milo. An ablation study for the 𝑅 value can be found in Appendix I.6. To ensure efficiency comparable to other adaptive baselines—namely CraigPB, GradMatchPB, and Glister—we used an 𝑅 value of 10 for vision experiments and 3 for text experiments. Accuracy degradation versus speedup plots, both in relation to full training, are provided in the sub-figures 6, 6, 6, 6, 6, and 6. Our results highlight Milo’s optimal speedup-accuracy tradeoff, making it a greener choice considering CO2 emissions. Using ResNet18 on CIFAR10, Milo achieves 3.34x and 10.69x speedups with 1.03% and 4.07% performance losses, respectively. Similarly, it achieves a 3.34x speedup with a mere 0.9% performance loss using ResNet50 on ImageNet. When employed with ResNet101 on TinyImageNet and CIFAR100, Milo yields approximately 3.2x speedup with 1.30% and 3.18% performance losses, respectively. Milo achieves even higher speedup gains of around 10x with performance losses of 2.30% and 1.23% on TREC6 and Rotten Tomatoes datasets, respectively. For finetuning the BERT+MLP model on the IMDB dataset, Milo attains a remarkable 24.94x speedup with just a 1.20% performance loss. Furthermore, our model notably surpasses the Adaptive-Random baseline, with this advantage being particularly evident on text datasets. On more complex vision datasets, the gap between Milo and Adaptive-Random widens, indicating the former’s superior performance. In sub-figures 6 and 6, the Milo is shown to achieve faster convergence compared to all other methods on CIFAR100 and TREC6 datasets using 30% subsets. Appendix H provides further efficient training results on additional datasets and model architectures, thereby demonstrating the Miloś generalizability.

(a) TREC6(Random,HB) \phantomcaption (b) TREC6(TPE, HB) \phantomcaption (c) CIFAR10(Random,HB) \phantomcaption (d) CIFAR10(TPE, HB) \phantomcaption Figure 7: Comparison of Milo with baselines for hyper-parameter tuning using subset sizes of 1%, 5%, 10%, and 30% is shown in sub-figures (a-d) for TREC6 and CIFAR10 datasets for combinations of Random Search and Hyperband Scheduler and TPE and Hyperband Scheduler. The scatter plots show that Milo achieves the best speedup-accuracy tradeoff in almost every case (bottom-right corner of each plot indicates the best speedup-accuracy tradeoff region).

Efficacy of Milo for Hyper-parameter Tuning: The effectiveness of Milo in preserving the original hyper-parameter ordering is evaluated in Appendix H.5. We found that Milo retains the hyper-parameter ordering better than the baseline methods. Figure 7 illustrates the accuracy-efficiency trade-off among various subset selection methods for hyper-parameter tuning. We evaluated the performance for different subset sizes of the training data: 1%, 5%, 10%, and 30% using combinations of Random Search (Pinto et al., 2009) and Hyperband Scheduler (Li et al., 2017), as well as TPE (Bergstra et al., 2011) and Hyperband Scheduler (Li et al., 2017). Sub-figures 7, 7, 7, and 7 display the accuracy degradation vs. speedup plots, both w.r.t full data tuning. Milo consistently outperforms the baselines, even with smaller subset sizes. Results indicate that Milo achieves the best tradeoff between speedup and accuracy for hyper-parameter tuning. Specifically, Milo achieves a 75 × and 20 × speedup on Cifar10 and TREC6 datasets, respectively, with a negligible performance loss of about 0.1 % .

5 Conclusion

We introduce Milo, a novel, model-agnostic subset selection framework for efficient model training and tuning. Milo not only rivals the efficiency of random subset selection but also outperforms current state-of-the-art strategies in model convergence. Empirically, Milo expedites model training and hyper-parameter tuning by 3x to 10x and 20x to 75x, respectively, with minimal performance degradation. Consequently, Milo offers a significant societal advantage by facilitating faster, energy-efficient model training and tuning, thus reducing CO2 emissions. Although Milo employs pre-trained transformer models, it does not rely heavily on them. These models can be replaced with smaller proxy models, though the preprocessing times will be slightly longer. However, Milo does present some challenges, most notably the requirement for a large amount of memory to construct similarity kernels, even with class-wise partitioning strategies. In the future, we will investigate feature-based submodular functions to avoid the need for similarity kernel construction, as well as potential biases introduced when training on data subsets, to further improve the Milo framework.

References Ash et al. [2020] Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In ICLR, 2020. Bartoldson et al. [2022] Brian R. Bartoldson, Bhavya Kailkhura, and Davis Blalock. Compute-efficient deep learning: Algorithmic trends and opportunities, 2022. URL https://arxiv.org/abs/2210.06640. Bergstra et al. [2011] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf. Biewald [2020] Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com. Birodkar et al. [2019] Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. Semantic redundancies in image-classification datasets: The 10% you don’t need. CoRR, abs/1901.11409, 2019. URL http://arxiv.org/abs/1901.11409. Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. Coleman et al. [2020] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJg2b0VYDr. Dasgupta et al. [2013] Anirban Dasgupta, Ravi Kumar, and Sujith Ravi. Summarization through submodularity and dispersion. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1014–1022, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL https://aclanthology.org/P13-1100. de Brébisson and Vincent [2016] Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.05042. Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019. Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy. Efraimidis and Spirakis [2016] Pavlos Efraimidis and Paul (Pavlos) Spirakis. Weighted Random Sampling, pages 2365–2367. Springer New York, New York, NY, 2016. ISBN 978-1-4939-2864-4. doi: 10.1007/978-1-4939-2864-4_478. URL https://doi.org/10.1007/978-1-4939-2864-4_478. Fujishige [2005] Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005. Gotovos et al. [2015] Alkis Gotovos, Hamed Hassani, and Andreas Krause. Sampling from probabilistic submodular models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/160c88652d47d0be60bfbfed25111412-Paper.pdf. Gurulingappa et al. [2012] Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics, 45(5):885 – 892, 2012. ISSN 1532-0464. doi: https://doi.org/10.1016/j.jbi.2012.04.008. URL http://www.sciencedirect.com/science/article/pii/S1532046412000615. Text Mining and Natural Language Processing in Pharmacogenomics. Hacohen and Weinshall [2019] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2535–2544. PMLR, 2019. URL http://proceedings.mlr.press/v97/hacohen19a.html. He et al. [2016] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. Hovy et al. [2001] Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Chin-Yew Lin, and Deepak Ravichandran. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://www.aclweb.org/anthology/H01-1069. Howard et al. [2017] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017. Karanam et al. [2022] Athresh Karanam, Krishnateja Killamsetty, Harsha Kokel, and Rishabh K Iyer. ORIENT: Submodular mutual information measures for data subset selection under distribution shift. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=mhP6mHgrg1c. Katharopoulos and Fleuret [2018] Angelos Katharopoulos and Francois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/katharopoulos18a.html. Kaushal et al. [2019] Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE, 2019. Kaushal et al. [2021] Vishal Kaushal, Suraj Kothawade, Ganesh Ramakrishnan, Jeff Bilmes, and Rishabh Iyer. Prism: A unified framework of parameterized submodular information measures for targeted data subset selection and summarization. arXiv preprint arXiv:2103.00128, 2021. Kaushal et al. [2022] Vishal Kaushal, Ganesh Ramakrishnan, and Rishabh Iyer. Submodlib: A submodular optimization library, 2022. URL https://arxiv.org/abs/2202.10680. Khan et al. [2022] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. In ACM Computing Surveys, volume 54, New York, NY, USA, sep 2022. Association for Computing Machinery. doi: 10.1145/3505244. URL https://doi.org/10.1145/3505244. Killamsetty et al. [2021a] Krishnateja Killamsetty, Dheeraj Bhat, Ganesh Ramakrishnan, and Rishabh Iyer. CORDS: COResets and Data Subset selection for Efficient Learning, March 2021a. URL https://github.com/decile-team/cords. Killamsetty et al. [2021b] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5464–5474. PMLR, 18–24 Jul 2021b. URL https://proceedings.mlr.press/v139/killamsetty21a.html. Killamsetty et al. [2021c] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8110–8118, May 2021c. URL https://ojs.aaai.org/index.php/AAAI/article/view/16988. Killamsetty et al. [2021d] Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh K Iyer. RETRIEVE: Coreset selection for efficient and robust semi-supervised learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021d. URL https://openreview.net/forum?id=jSz59N8NvUP. Killamsetty et al. [2022] Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti Lnu, Ganesh Ramakrishnan, Alexandre V. Evfimievski, Lucian Popa, and Rishabh K Iyer. AUTOMATA: Gradient based data subset selection for compute-efficient hyper-parameter tuning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ajH17-Pb43A. Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980. Kirchhoff and Bilmes [2014] Katrin Kirchhoff and Jeff Bilmes. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 131–141, 2014. Kothawade et al. [2021] Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes, and Rishabh Iyer. Submodular mutual information for targeted data subset selection. arXiv preprint arXiv:2105.00043, 2021. Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. Le and Yang [2015] Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge, 2015. Lee and Grauman [2011] Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR 2011, pages 1721–1728, 2011. doi: 10.1109/CVPR.2011.5995523. Li et al. [2020] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A system for massively parallel hyperparameter tuning. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 230–246, 2020. URL https://proceedings.mlsys.org/paper/2020/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf. Li et al. [2017] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res., 18(1):6765–6816, jan 2017. ISSN 1532-4435. Li and Roth [2002] Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://www.aclweb.org/anthology/C02-1150. Liaw et al. [2018] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018. Liu et al. [2020] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020. URL https://openreview.net/forum?id=SyxS0T4tvS. Loshchilov and Hutter [2015] Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015. Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx. Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7. Maas et al. [2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015. Mirzasoleiman et al. [2015] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. Mirzasoleiman et al. [2020] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models, 2020. Nemhauser et al. [1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978. Pang and Lee [2005] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL, 2005. Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. Paul et al. [2021] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20596–20607. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ac56f8fe9eea3e4a365f29f0f1957c55-Paper.pdf. Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162. Pinto et al. [2009] Nicolas Pinto, David Doukhan, James J. DiCarlo, and David D. Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology, 5, 2009. Pooladzandi et al. [2022] Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. Adaptive second order coresets for data-efficient machine learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17848–17869. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/pooladzandi22a.html. Qiu et al. [2020] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey. CoRR, abs/2003.08271, 2020. URL https://arxiv.org/abs/2003.08271. Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html. Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084. Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. Schwartz et al. [2020] Roy Schwartz, Jesse Dodge, Noah Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63:54 – 63, 2020. Sener and Savarese [2018] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. Song et al. [2020] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. Sorscher et al. [2022] Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=UmvSlP-PyV. Strubell et al. [2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://aclanthology.org/P19-1355. Tan and Le [2020] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020. Tiwari et al. [2021] Rishabh Tiwari, Krishnateja Killamsetty, Rishabh K. Iyer, and Pradeep Shenoy. Gcr: Gradient coreset based replay buffer selection for continual learning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 99–108, 2021. Toneva et al. [2019] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJlxm30cKm. Wei et al. [2014a] Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, and Jeff Bilmes. Submodular subset selection for large-scale speech training data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3311–3315. IEEE, 2014a. Wei et al. [2014b] Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. Unsupervised submodular subset selection for speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4107–4111. IEEE, 2014b. Wolf et al. [2019] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2019. URL https://arxiv.org/abs/1910.03771. Yang et al. [2021] Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191–195, 2021. Yang et al. [2023] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023. Zhou et al. [2020] Tianyi Zhou, Shengjie Wang, and Jeffrey Bilmes. Curriculum learning by dynamic instance hardness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8602–8613. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/62000dee5a05a6a71de3a6127a68778a-Paper.pdf.

Supplementary Material

Appendix \parttoc Appendix A Code

The code of Milo is available at the following link: https://anonymous.4open.science/r/MILO-282B/.

Appendix B Licenses

We release the code repository of Milo with MIT license, and it is available for everybody to use freely. We use the popular deep learning framework [Paszke et al., 2019] for implementation of Milo framework, wandb [Biewald, 2020] for hyper-parameter search and scheduling algorithms, SUBMODLIB [Kaushal et al., 2022] for submodular functions and submodular maximization, and CORDS [Killamsetty et al., 2021a] for subset selection strategies. We use TREC6 [Li and Roth, 2002, Hovy et al., 2001], IMDB [Maas et al., 2011], Rotten Tomatoes [Pang and Lee, 2005], Cifar10 [Krizhevsky and Hinton, 2009], Cifar100 [Krizhevsky and Hinton, 2009], TinyImageNet [Le and Yang, 2015], ImageNet [Russakovsky et al., 2015], OrganCMNIST [Yang et al., 2021, 2023], DermaMNIST [Yang et al., 2021, 2023], and AdeCorpusV2 [Gurulingappa et al., 2012] datasets. The license of the TREC6 [Li and Roth, 2002, Hovy et al., 2001] dataset is CC0:Public Domain. IMDB [Maas et al., 2011] dataset is released with a non-commercial license. The licenses of Rotten Tomatoes and AdeCorpusV2 datasets are unknown. But we used the publicly available version from the "HuggingFace Datasets" package. Cifar10, Cifar100, and TinyImageNet are released under MIT license. ImageNet [Russakovsky et al., 2015] is released under a non-commercial license. OrganCMNIST [Yang et al., 2021, 2023] and DermaMNIST [Yang et al., 2021, 2023] datasets are released under Creative Commons Attribution 4.0 International license (CC BY 4.0). Furthermore, all the datasets and pre-trained models used in this work are publicly available. In addition, the datasets used do not contain any personally identifiable information.

Appendix C Milo Algorithm Pseudocode

We give the pseudo-code of Milo algorithm for model-training in Algorithm 1.

Input: Train set: 𝒟 ; initial params: 𝜃 0 ; learning rate: 𝛼 ; total epochs: 𝑇 ; selection interval: 𝑅 ; SGE Fraction: 𝜅 ; Training Loss: 𝐿 𝑇 , Mini-batch size: 𝐵 , Subset size: 𝑘 Initialize Graph-Cut function: 𝑓 1 ⁢ ( 𝒮 )

∑ 𝑖 ∈ 𝒟 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 − 0.4 ⁢ ∑ 𝑖 ∈ 𝒮 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 Initialize Disparity-Min function: 𝑓 2 ⁢ ( 𝒮 )

min 𝑖 , 𝑗 ∈ 𝒮 , 𝑖 ≠ 𝑗 ⁡ ( 1 − 𝑠 𝑖 ⁢ 𝑗 ) * Check if dataset is pre-processed already * if is ⁢ _ ⁢ preprocessed ⁡ ( 𝒟 ) then * Retrieve pre-selected subsets and pre-constructed probability from metadata * 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅 , 𝒑

loadmetadata ⁡ ( 𝒟 ) else * Stochastic Greedy Exploration with Graph-Cut * 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅

SGE ⁡ ( 𝑓 1 , 𝒟 , 𝑘 ) * Weighted Random Exploration with Disparity-Min * 𝒈

[ 𝑔 0 , 𝑔 1 , ⋯ , 𝑔 | 𝒟 | ]

GreedySampleImportance ⁡ ( 𝑓 2 , 𝒟 ) 𝒑

[ 1 + 𝑔 𝑖 + 0.5 ⁢ 𝑔 𝑖 2 ∑ 𝑗

1 | 𝒟 | 1 + 𝑔 𝑗 + 0.5 ⁢ 𝑔 𝑗 2 ] 𝑖

1 | 𝒟 | Stored selected subsets and constructed probability as metadata storemetadata ⁡ ( 𝒟 , 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅 , 𝒑 ) * Model training with subset curriculum * Initialize the subset: 𝒮

𝒮 0 * Model training on representative/easy subsets * for epochs 𝑡 in 0 , ⋯ , 𝜅 ⁢ 𝑇 − 1 do if ( 𝑡 mod 𝑅

= 0 ) then Update the Subset: 𝒮

𝒮 𝑡 𝜃 𝑡 + 1

mini-BatchSGD ⁢ ( 𝒮 , 𝛼 , 𝐿 𝑇 , 𝐵 , Epochs

1 ) * Model training on diverse/hard subsets * for epochs 𝑡 in 𝜅 ⁢ 𝑇 , ⋯ , 𝑇 − 1 do if ( 𝑡 − 𝜅 𝑇 mod 𝑅

= 0 ) then Sample subset 𝒮 𝑡 from dataset 𝒟 using the probability 𝒑 without replacement Update the Subset: 𝒮

𝒮 𝑡 𝜃 𝑡 + 1

mini-BatchSGD ⁢ ( 𝒮 , 𝛼 , 𝐿 𝑇 , 𝐵 , Epochs

1 ) return Final model parameters 𝜃 𝑇 Algorithm 1 Milo Algorithm Input: Train set: 𝒟 ; Subset Size: 𝑘 ; Set function: 𝑓 ; 𝜖

1 ⁢ 𝑒 − 2 ; Number of Subsets: 𝑛 * Sampling n subsets using stochastic-greedy * for 𝑖 ∈ 0 , ⋯ ⁢ 𝑛 − 1 do Initialize empty subset: 𝒮 𝑖

∅ Set random subset size for the stochastic-greedy algorithm: 𝑠

| 𝒟 | 𝑘 ⁢ log ⁡ ( 1 𝜖 ) for 𝑗 ∈ 0 , ⋯ , 𝑘 − 1 do * Sample a random subset by sampling 𝑠 elements from 𝒟 ∖ 𝒮 𝑖 * 𝑅 ← sample ⁡ ( 𝒟 ∖ 𝒮 𝑖 , 𝑠 ) 𝑒

arg ⁢ max 𝑒 ∈ 𝑅 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝒮 𝑖

𝒮 𝑖 ∪ { 𝑒 } return 𝒮 0 , ⋯ , 𝒮 𝑛 − 1 Algorithm 2 SGE Input: Train set: 𝒟 ; Subset Size: 𝑘 ; Set function: 𝑓 Initialize empty subset: 𝒮

∅ Initialize gains vector of dimension | 𝒟 | to zeros 𝒈

0 → * Calculate greedy informative gains for each element * for 𝑗 ∈ 0 , ⋯ , | 𝒟 | − 1 do 𝑔

max 𝑒 ∈ 𝒟 ∖ 𝒮 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝑒

arg ⁢ max 𝑒 ∈ 𝒟 ∖ 𝒮 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝒈 ⁢ [ 𝑒 ]

𝑔 return gain vector 𝐠 Algorithm 3 GreedySampleImportance Algorithm Appendix D Instantiations of Different Submodular Functions D.1 Representation Functions D.1.1 Facility Location

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 with n data samples, the facility location objective can be given as following:

𝑓 ⁢ ( 𝒮 )

∑ 𝑖 ∈ 𝒟 max 𝑗 ∈ 𝒮 ⁡ 𝑠 𝑖 ⁢ 𝑗 (6)

where 𝑠 𝑖 ⁢ 𝑗 denotes the similarity of data samples 𝑖 and 𝑗 . For each data point 𝑖 in the ground set 𝒟 , we compute the most representative samples 𝑗 from the subset 𝒮 which is closest to it and add these similarities for all data points. Note that due to the sum-max formulation involved in facility location, having one representative data sample from each cluster in your dataset is enough for maximizing the facility location value. This further prevents the selection of more samples in the subset from more dense regions.

D.1.2 Graph-Cut

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 with n data samples, the graph-cut objective can be given as following:

𝑓 ⁢ ( 𝒮 )

∑ 𝑖 ∈ 𝒟 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 − 𝜆 ⁢ ∑ 𝑖 ∈ 𝒮 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 (7)

The 𝜆 parameter in the above equation controls the trade-off between diversity and representation. When 𝜆 is large, the graph cut function also incorporates diversity into its model. In our experiments, we set 𝜆

0.4 , thereby making the graph-cut function model representation more and making it monotone-submodular. Note that due to the sum-sum formulation involved in graph-cut, selecting more samples from dense regions in the center leads to the maximization of the graph-cut function value. Thus, graph-cut can result in the selection of subsets consisting of easy samples from very dense regions in the dataset.

D.2 Diversity Functions D.2.1 Disparity-Sum

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 with n data samples, the disparity-sum objective can be given as following:

𝑓 ⁢ ( 𝒮 )

∑ 𝑖 ∈ 𝒮 ∑ 𝑗 ∈ 𝒮 ( 1 − 𝑠 𝑖 ⁢ 𝑗 ) (8)

Note that in the above equation, disparity-sum purely concentrates on selected samples that are maximally different from each other in the dataset without any care for the representation.

D.2.2 Disparity-Min

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 with n data samples, the disparity-min objective can be given as following:

𝑓 ⁢ ( 𝒮 )

min 𝑖 , 𝑗 ∈ 𝒮 , 𝑖 ≠ 𝑗 ⁡ ( 1 − 𝑠 𝑖 ⁢ 𝑗 ) (9)

In the above equation, 1 − 𝑠 𝑖 ⁢ 𝑗 can be considered as a distance measure, and maximization of the disparity-min objective results in the maximization of the minimum distance between the samples in the selected subset. Even though the disparity-min objective is not submodular, it is proven to perform empirically well using conventional greedy approaches [Dasgupta et al., 2013].

Appendix E Assessing EL2N Scores of Subset Selection with Different Set Functions

In this section, we assess the difficulty of data samples by measuring the hardness of subsets selected using different set functions. This evaluation is done using the EL2N metric [Paul et al., 2021]. Tables 2 and 1 present the Average Sample and Median Sample scores for the selected subsets.

Dataset Model Subset Fraction Set Function EL2N (mean) EL2N (median) CIFAR100 ResNet18 0.01 Graph Cut 0.31932 0.25196815 Facility Location 0.5005 0.48838884 Disparity Min 0.9512 0.998251547 Disparity Sum 0.9432 0.9945051 0.05 Graph Cut 0.39428785 0.35422683 Facility Location 0.6082416 0.6329421 Disparity Min 0.8688732 0.94560076 Disparity Sum 0.8725095 0.947182 0.1 Graph Cut 0.44762444 0.4188764 Facility Location 0.67536277 0.74037623 Disparity Min 0.8125279 0.8964435 Disparity Sum 0.8283565 0.90549386 0.3 Graph Cut 0.534334 0.5372032 Facility Location 0.68066484 0.73706484 Disparity Min 0.79167026 0.89193624 Disparity Sum 0.7429611 0.81957316 Table 1: Table presents the Mean and Median EL2N scores [Paul et al., 2021] for the CIFAR100 subsets of different sizes selected using various set functions. Dataset Model Subset Fraction Set Function EL2N (mean) EL2N (median) CIFAR10 ResNet18 0.01 Graph Cut 0.053498093 0.093978405 Facility Location 0.24479946 0.32504702 Disparity Min 0.3455502 0.34030083 Disparity Sum 0.34296998 0.3437602 0.05 Graph Cut 0.093978405 0.039304152 Facility Location 0.32504702 0.21770108 Disparity Min 0.34030083 0.24009448 Disparity Sum 0.3437602 0.24489668 0.1 Graph Cut 0.13287988 0.060230725 Facility Location 0.33706713 0.2388236 Disparity Min 0.37538758 0.24573568 Disparity Sum 0.36741473 0.24983436 0.3 Graph Cut 0.2021475 0.11052186 Facility Location 0.33232126 0.23153576 Disparity Min 0.38120695 0.26727863 Disparity Sum 0.35034057 0.25104147 Table 2: Table presents the Mean and Median EL2N scores [Paul et al., 2021] for the CIFAR10 subsets of different sizes selected using various set functions.

The results demonstrate that subsets selected using Facility Location and GraphCut exhibit lower hardness scores compared to those chosen by Disparity Min and Disparity Sum. This finding provides strong support for the assertion that Facility Location and GraphCut tend to prefer subsets that are easier, while Disparity Min and Disparity Sum tend to select subsets that are harder. Moreover, the results reveal an expected pattern in the disparity of EL2N scores between diversity-based set functions and representation-based set functions. Specifically, this disparity is more pronounced for smaller subset sizes and gradually diminishes as the subset size increases.

Another noteworthy finding is that the hardness scores for samples in the CIFAR10 dataset are lower than those in the CIFAR100 dataset. This observation offers a plausible explanation for why a straightforward baseline method like Adaptive-Random performs comparably well to Milo on the CIFAR10 dataset. This finding further bolsters our claim, as stated in Section 4, that Milo outperforms Adaptive-Random as the dataset complexity increases. These additional insights contribute to the overall strength of our arguments and provide a more comprehensive understanding of the interplay between set functions, hardness scores, and dataset complexity.

Appendix F Additional Dataset Details Dataset #Classes #Train #Validation #Test TREC6 6 4907 545 500 IMDB 2 22500 25000 25000 Rotten Tomatoes 2 8530 1066 1066 AdeCorpusV2 2 23516 Table 3: Number of classes, Number of instances in Train, Validation and Test split in Text datasets F.1 Details of Text Datasets

We conducted experiments on three text datasets: TREC6 [Li and Roth, 2002, Hovy et al., 2001], IMDB [Maas et al., 2011], and Rotten Tomatoes [Pang and Lee, 2005]. TREC6 is a question classification dataset comprising open-domain, fact-based questions categorized into broad semantic groups: ABBR (Abbreviation), DESC (Description and Abstract Concepts), ENTY (Entities), HUM (Human Beings), LOC (Locations), and NYM (Numeric Values). Both IMDB and Rotten Tomatoes datasets are sentiment classification datasets, with reviews sourced from their respective platforms. For specialized domain datasets, we utilized the AdeCorpusV2 dataset [Gurulingappa et al., 2012], which contains data on adverse drug reactions.

The AdeCorpusV2 dataset was split such that 20% of the original data was used for testing, and 10% of the remaining data served as a validation set. The validation data for TREC6 and IMDB datasets were obtained by using 10% of the training data. A seed value of 42 was used in the generator argument for the random_split function in PyTorch. The number of classes and instances in each split of the text datasets are summarized in Table 3.

F.2 Details of Vision Datasets

We performed experiments on Cifar10 [Krizhevsky and Hinton, 2009], Cifar100 [Krizhevsky and Hinton, 2009], and TinyImageNet [Le and Yang, 2015] datasets. The Cifar10 [Krizhevsky and Hinton, 2009] dataset contains 60,000 colored images of size 32 × 32 divided into ten classes, each with 6000 images. Cifar100 [Krizhevsky and Hinton, 2009] is also of size 32 × 32 but contains 600 images per class and 100 classes. Both Cifar10 [Krizhevsky and Hinton, 2009] and Cifar100 [Krizhevsky and Hinton, 2009] have 50,000 training samples and 10,000 test samples distributed equally across all classes. TinyImageNet [Le and Yang, 2015] dataset contains 120,000 colored images of size 64 × 64 from 200 classes with 600 images per each class. ImageNet [Russakovsky et al., 2015] dataset contains 1281167 colored images of size 224 × 224 from 1000 classes. For ImageNet dataset, we report the accuracy on the validation set as test accuracy in this work. For experiments on datasets from specialized domains, we use OrganCMNIST [Yang et al., 2021, 2023] and DermaMNIST [Yang et al., 2021, 2023] datasets. OrganCMNIST dataset contains AbdominalCT gray images of size 32 × 32 from 11 classes. DermaMNIST dataset contains dermatoscope images of size 32 × 32 from 7 classes.

Dataset #Classes #Train #Validation #Test Cifar10 10 45000 5000 10000 Cifar100 100 45000 5000 10000 TinyImageNet 200 100000 10000 10000 ImageNet 100 1281167 50000 OrganCMNIST 11 13000 2392 8268 DermaMNIST 7 7007 1003 2005 Table 4: Number of classes, Number of instances in Train, Validation and Test split in Image datasets

For both Cifar10 and Cifar100 datasets, 10 % of the training data is used for validation (seed value = 42). In Table 4, we summarize the number of classes and the number of instances in each split in the image datasets.

Appendix G Additional Experimental Details G.1 GPU Resources

We performed experiments on an 80GB A100 GPU cluster. To be fair in timing computation, we ran Milo and all other baselines for a particular setting on the same GPU server.

Figure 8: Block Diagram of Milo for hyper-parameter tuning where each individual configuration training is done using Milo instead of training on the full dataset. G.2 Experimental Details for Model Training

We compare Milo with Random: randomly sample a fixed subset of the same size subset used by Milo from the training data, Adaptive-Random: adaptively sample a random subset of the same size subset used by Milo from the training data every 𝑅 epochs, Full: using the entire training data for model training and tuning, Full-EarlyStop: where we do an early stop to full training to match the time taken (or energy used) by Milo, and adaptive gradient-based subset selection strategies for efficient learning where a new subset is selected every 𝑅 epochs, namely CraigPB: the faster per-batch version of Craig [Mirzasoleiman et al., 2020] shown in Killamsetty et al. [2021b], Glister [Killamsetty et al., 2021c], Grad-MatchPB: the per-batch version of Grad-Match [Killamsetty et al., 2021b]. Our experiments use a 𝑅 value of 1 (i.e., subset selection every epoch) for Milo and Adaptive-Random. We empirically observed that using 𝑅

1 results in better model performance with Milo. In order to achieve comparable efficiency with other adaptive baselines, including CraigPB, GradMatchPB, and Glister, we use an 𝑅 value of 10 for vision experiments and a 𝑅 value of 3 for text experiments.

Sepcific Details of Text Experiments For text datasets, we use the LSTM model (from PyTorch) with trainable GloVe embeddings [Pennington et al., 2014] of dimension 300 as input and the BERT+MLP model consisting of Bert-Base [Devlin et al., 2019] model with an added two layer MLP for classification. We train the LSTM model for 24 epochs and the BERT model for 12 epochs using a batch size of 16. For text datasets using the LSTM model, we use the Adam optimizer [Kingma and Ba, 2015] with a learning rate of 0.001. For BERT model finetuning, we use the AdamW optimizer [Loshchilov and Hutter, 2019] with a learning rate of 5e-5.

Specific Details of Image Experiments For vision datasets, we use the ResNet18 [He et al., 2016] and ResNet101 [He et al., 2016] models. We train the ResNet [He et al., 2016] models for 200 epochs using a batch size of 128. For training ResNet models with Milo and baselines, we use Nesterov’s accelerated SGD optimizer with a learning rate of 0.05, weight decay of 5e-4, momentum of 0.9, and a cosine annealing [Loshchilov and Hutter, 2017] learning rate scheduler for all the experiments.

G.3 Experimental Details for Hyper-Parameter Tuning

We follow the experimental setup of Automata [Killamsetty et al., 2022], an efficient hyperparameter tuning framework using Grad-MatchPB and replace the subset selection strategy with MiloḞigure 8 gives a pictorial depiction of the hyper-parameter tuning pipeline with MiloȦs discussed in Automata [Killamsetty et al., 2022], the hyper-parameter tuning pipeline contains three major components: a) hyper-parameter search algorithms - that returns the list of hyper-parameter configurations that need to be evaluated; b) configuration evaluations involving subset based model training runs - where for each configuration, we train a model using the selected configuration on the subsets selected by Milo , c) hyper-parameter schedulers - that decides the resources allocated for each configuration evaluation and what configurations to be discarded early. We evaluate the effectiveness of Milo for hyperparameter tuning, by performing subset-based configuration evaluations using Random, Full, Adaptive-Random, and Automata(Grad-MatchPB) as subset selection baselines. For tuning with Full datasets, the entire dataset is used to train the model during hyperparameter tuning. But for other baselines, we use a subset of the dataset to train various models during tuning. In addition, the subset selection techniques used are adaptive, which means that the model is trained on a different subset every few epochs. We empirically observed that using 𝑅

1 results in better model performance with Milo. In order to achieve comparable efficiency with other adaptive baselines, including CraigPB, and GradMatchPB, we use an 𝑅 value of 10 for vision experiments and a 𝑅 value of 3 for text experiments. We experiment with the combination of a) Random Search [Pinto et al., 2009] and Hyperband [Li et al., 2017] scheduler; and b)TPE Search [Bergstra et al., 2011] and Hyperband [Li et al., 2017]. We only test with Hyperband [Li et al., 2017] scheduler because we use Wandb [Biewald, 2020] for running hyper-parameter optimization algorithms, and wandb provides the support for Hyperband [Li et al., 2017] only. We use Wand for hyper-parameter to accurately compute the tuning times, which we found difficult with Ray [Liaw et al., 2018] due to its high internal communication times.

Specific Details of Text Experiments For text datasets, we use the LSTM model (from PyTorch) with trainable GloVe embeddings [Pennington et al., 2014] of dimension 300 as input. The hyper-parameter space for experiments on text datasets includes learning rate, hidden size & number of layers of LSTM, and batch size of training. In some experiments (with TPE search algorithm) where the best configuration among 108 configurations are found, the hyper-parameter space is learning rate: [0.001,0.1], LSTM hidden size: {64,128,256, 512}, number of layers in LSTM: {1, 2}, batch size: {16,32,64}.

Specific Details of Image Experiments For vision datasets, we use the ResNet18 [He et al., 2016] model. The hyper-parameter search space for tuning experiments on image datasets includes a choice between the Momentum method and Nesterov Accelerated Gradient method, a choice of learning rate scheduler and their corresponding parameters, and four different group-wise learning rates, 𝑙 ⁢ 𝑟 1 for layers of the first group, 𝑙 ⁢ 𝑟 2 for layers of intermediate groups, 𝑙 ⁢ 𝑟 3 for layers of the last group of ResNet model, and 𝑙 ⁢ 𝑟 4 for the final fully connected layer. For the learning rate scheduler, we change the learning rates during training using either a cosine annealing schedule or decay it linearly by 𝛾 after every 20 epochs. Best configuration for most experiments is selected from 108 configurations where the hyper-parameter space is 𝑙 ⁢ 𝑟 1 : [0.001, 0.01], 𝑙 ⁢ 𝑟 2 : [0.001, 0.01], 𝑙 ⁢ 𝑟 3 : [0.001, 0.01], 𝑙 ⁢ 𝑟 4 : [0.001, 0.01], Nesterov: {True, False}, learning rate scheduler: {Cosine Annealing, Linear Decay}, 𝛾 : [0.05, 0.5].

Appendix H Additional Experimental Results H.1 Results on Specialized Domain Datasets

In Milo, pre-trained models serve a key role in capturing pairwise interactions between data samples in zero-shot setting. The resulting pairwise similarity matrix is then employed for subset selection and probability distribution construction. Notably, these pre-trained models (all-distilroberta-v1, DINO (CLS)), which have been previously trained on natural images or a general text corpus, have demonstrated a certain degree of success in recognizing these interactions, even within specialized domain samples. This can be attributed to the training approach of the pre-trained transformer models utilized in this work. They are trained using either a contrastive or denoising objective, which enables them to identify low-level features in examples from specialized domains, thereby facilitating the capturing of interactions between data samples.

Empirically, we observed that pre-trained transformer models generalize well to datasets from different domains for subset selection. This was demonstrated through a series of experiments on specialized domain datasets. For vision datasets, we conducted experiments on the OrganCMNIST dataset using the EfficientNetB0 model [Tan and Le, 2020], and the DermaMNIST dataset using the MobileNet model [Howard et al., 2017], for subset sizes of 5% and 10%. The EfficientNet model was suitably adapted for single-channel images. For text datasets, we carried out experiments on the AdeCorpusV2 dataset using an LSTM model for subset sizes of 5% and 10%. We adhered to the training details outlined in Appendix G.

(a) AdeCorpusV2 (LSTM) (b) OrganCMNIST (EfficientNetB0) (c) DermaMNIST (MobileNet) Figure 9: Comparison of Milo to baselines using 5%, 10% subsets of specialized domain datasets for model training with general pre-trained transformer models as feature encoders.

Figure 9 presents scatter plots that demonstrate the trade-off between accuracy degradation and computational speed-up for Milo. These results were obtained using general pre-trained transformer models as feature encoders and compared against baseline models on specialized domain datasets. The subsets utilized for these models constituted 5% and 10% of the full dataset, using all-distilroberta-v1 as the text feature encoder and DINO (CLS) as the image feature encoder. The results clearly show that Milo consistently outperforms the baselines in achieving an optimal balance between efficiency and performance, even when general pre-trained models are used.

Moreover, the risk of model overfitting to a small subset and not generalizing well is significantly higher when a fixed data subset is used. However, Milo allows the downstream model to explore the entire dataset through Stochastic-Greedy Exploration (SGE) and Weighted Random Exploration (WRE). This approach reduces the risk of model overfitting to a small data subset and improves generalization performance. Our empirical findings suggest that Milo closely approximates optimal full data training performance and experiences minimal performance degradation compared to the baselines. We argue, therefore, that Milo offers a viable approach for training downstream models that exhibit strong generalization, even when using pre-trained encoders.

H.2 Results using Proxy Model

As demonstrated in Appendix H.1, the pre-trained models (all-distilroberta-v1, DINO (CLS)) utilized in this work have exhibited generalization capabilities to datasets from unseen or specialized domains. However, there might be cases where these models’ performance is suboptimal, which can be conveniently assessed through linear probing accuracies. In these instances, a smaller proxy model, such as ResNet18, can be trained to convergence and used as a feature encoder for Milo, potentially improving performance.

Although training a proxy model increases pre-processing costs, it’s worth noting that Milo’s data pre-processing only needs to be performed once per dataset. Thus, despite an initial increase in pre-processing costs, these can be amortized over time when training multiple downstream models.

To demonstrate Milo’s effectiveness when using proxy models, we conducted a series of experiments on the Cifar100 and OrganCMNIST datasets. Using ResNet18 as a proxy model, and the inputs to the final fully connected layer as sample features, we generated subsets and probability distributions as part of the pre-processing using Milo’s data exploration strategies. To evaluate the efficiency and generalizability of Milo in the context of using a proxy model for training, we conducted experiments with ResNet101, EfficientNetB0, and DenseNet121 models. These models were trained using subsets of the Cifar100 and OrganCMNIST datasets, selected by Milo and the considered baselines, with subset sizes of 1%, 5%, 10%, and 30%.

(a) CIFAR100 (ResNet101) (b) CIFAR100 (EfficientNetB0) (c) CIFAR100 (DenseNet121) (d) OrganCMNIST (ResNet101) (e) OrganCMNIST (EfficientNetB0) (f) OrganCMNIST (DenseNet121) Figure 10: Comparison of Milo to baselines using 1%, 5%, 10%, and 30% subsets of Cifar100 and OrganCMNIST datasets for model training with the ResNet18 proxy model as the feature encoder.

Figure 10 presents scatter plots illustrating the trade-off between accuracy degradation and computational speed-up for Milo, using the ResNet18 proxy model as a feature encoder, and comparing it to baseline models on the Cifar100 and OrganCMNIST datasets. The results clearly show that Milo consistently outperforms the baselines in achieving an optimal balance between efficiency and performance, even when utilizing proxy models. These results indicate that Milo is not heavily dependent on pre-trained transformer models and can effectively utilize any informative feature encoders. The primary conclusion drawn from these experiments is that the Milo framework is highly adaptable and can be readily applied across a wide range of scenarios for efficient model training.

H.3 Data Pre-processing Time

The preprocessing time for Milo, utilizing pre-trained transformer models as feature encoders, on a machine equipped with 500 GB RAM and an A100 GPU with 80GB memory, for several datasets, is as follows:

Cifar10 Dataset: 5 minutes

Cifar100 Dataset: 5 minutes

ImageNet Dataset: 20 minutes

The preprocessing times for Milo, using pre-trained transformer models as feature encoders, are relatively minimal compared to the full training times for models of varying sizes. Specifically, the preprocessing times amount to between 3% and 5% of the full training times for a medium-sized model like ResNet18, and less than 3% of the full training time for a larger model like ResNet101 on the datasets considered. Importantly, this preprocessing step is independent of the model used for training and can be shared across any model being trained on these datasets. Therefore, it only needs to be done once per dataset and subset size.

However, when using proxy models, the preprocessing costs do increase because the training time of the proxy model also needs to be taken into account. For instance, when using a ResNet18 as a proxy model, the preprocessing times can equate to between 103% and 105% of the full training times. Despite this increase, these preprocessing costs can be amortized over multiple model trainings, maintaining the time-efficiency of Milo for training models on large datasets.

H.4 Test-Accuracies, Training times and Standard deviations for Model Training Experiments

Table 5, Table 7 shows the top-1 test accuracies and training times taken by Milo and the other baselines considered for single model training on Cifar10, Cifar100, TinyImageNet, TREC6, IMDB, Rotten Tomaotes datasets for different subset sizes of 1%, 5%, 10%, and 30% respectively. Furthermore, Table 6, Table 8 gives the standard deviation numbers of Milo and other baselines for single model training on Cifar10, Cifar100, TinyImageNet, TREC6, IMDB, Rotten Tomaotes datasets for different subset sizes of 1%, 5%, 10%, and 30% respectively.

Model Training Results on Vision Datasets Top-1 Test accuracy of the Model(%) Model Training time(in hrs) Budget(%) 1% 5% 10% 30% 1% 5% 10% 30% Dataset Model Selection Strategy CIFAR10 ResNet18 Full (skyline for test accuracy) 95.19 95.19 95.19 95.19 1.72736889 1.72736889 1.72736889 1.72736889 Random (skyline for training time) 39.91 63.52 77.47 89.62 0.01646028 0.08030361 0.1722375 0.50628667 Adaptive-Random (skyline for training time) 63.71 88.2 91.09 94.05 0.01632191 0.08136123 0.1776311 0.5068371 Glister 38.2 79.02 90.67 93.04 0.08538972 0.16018167 0.23015611 0.57614889 CraigPB 64.11 84.35 88.97 92.99 0.13940694 0.19342472 0.26544833 0.60756111 GradMatchPB 61.59 85.34 89.67 93.71 0.08093056 0.13492361 0.21210222 0.52081639 Milo (Fixed) 38.17 64.17 78.06 89.21 0.01667131 0.0821212 0.172141 0.50281731 Milo 68.56 88.37 91.11 94.12 0.0163121 0.0808317 0.1702131 0.50645219 CIFAR10 ResNet101 Full (skyline for test accuracy) 95.01 95.01 95.01 95.01 2.607 2.607 2.607 2.607 Random (skyline for training time) 31.49 54.35 69.99 88.51 0.02820806 0.14013056 0.27079694 0.85268889 Adaptive-Random (skyline for training time) 21.72 80.25 88.13 94.2 0.0270875 0.13807778 0.2795975 0.82271222 Glister 10.46 62.41 83.79 92.47 0.12346444 0.24328889 0.38844556 1.05459139 CraigPB 33.74 74.07 85.4 92.44 0.16501306 0.27757139 0.41062028 0.9307625 GradMatchPB 27.23 71.71 86.35 93.39 0.13171083 0.23819667 0.36952333 0.89125889 Milo (Fixed) 31.14 52.73 66.53 87.87 0.0269 0.1321025 0.28677639 0.850435 Milo 43.37 81.63 88.72 94.24 0.026376 0.1321091 0.28635171 0.8503878 CIFAR100 ResNet18 Full (skyline for test accuracy) 77.03 77.03 77.03 77.03 1.52 1.52 1.52 1.52 Random (skyline for training time) 8.937 21.74 35.03 61.93 0.0152 0.0836 0.1554 0.449 Adaptive-Random (skyline for training time) 29.99 61.86 69.14 74.82 0.016 0.0757 0.1532 0.4491 Glister 2.72 52.17 65.72 72.59 0.0872 0.1553 0.2362 0.5717 CraigPB 23.38 51.56 59.6 70.25 0.1344 0.193 0.2679 0.5568 GradMatchPB 24.4 54.42 65.93 73.57 0.07873 0.1384 0.2125 0.4876 Milo (Fixed) 9.133 22.57 36.03 61.97 0.0183 0.0797 0.1576 0.449 Milo 35.34 64.4 69.28 74.95 0.0173 0.082 0.1506 0.4442 CIFAR100 ResNet101 Full (skyline for test accuracy) 78.46 78.46 78.46 78.46 2.6034 2.6034 2.6034 2.6034 Random (skyline for training time) 5.18 17.81 30.17 57.93 0.02952 0.14524 0.2879 0.825375 Adaptive-Random (skyline for training time) 6.199 39.11 60.02 73.21 0.03112 0.14523 0.2879 0.825175 Glister 1.634 36.76 56.37 73.7 0.12156 0.24377 0.39149 1.17212 CraigPB 6.8 36.68 50.92 69.02 0.1697 0.2727 0.40612 0.912357 GradMatchPB 6.653 38.23 55.82 74.35 0.136 0.246 0.3665 0.91235 Milo (Fixed) 6.064 17.17 31.73 59.05 0.02952 0.14523 0.28797 0.82537 Milo 12.99 49.89 66.11 75.28 0.030778 0.145234 0.28797 0.8253 TinyImageNet ResNet18 Full (skyline for test accuracy) 52.44 52.44 52.44 52.44 15.411 15.411 15.411 15.411 Random (skyline for training time) 3.21 13 19.61 35.68 0.171 0.874 1.82 4.99 Adaptive-Random (skyline for training time) 0.62 27.34 38.73 50.3 0.168 0.832 1.82 5.12 Glister 0.9667 25.53 36.41 46.86 1.375 2.061 2.823 5.968 CraigPB 1.31 22.31 34.21 4.31 1.98 2.56 3.241 6.42 GradMatchPB 3.21 28.41 35.64 50.34 1.68 2.02 2.674 5.954 Milo (Fixed) 4.517 13.83 20.03 35.79 0.1795 0.79 1.82 4.89 Milo 16.33 31.7 39.99 50.56 0.1701 0.81 1.81 4.97 TinyImageNet ResNet101 Full (skyline for test accuracy) 56.32 56.32 56.32 56.32 17.2387 17.2387 17.2387 17.2387 Random (skyline for training time) 2.84 10.52 17.57 35.68 0.150245 0.91737 1.826483 5.437941 Adaptive-Random (skyline for training time) 0.6067 15.99 33.19 53.3 0.1710 0.91737 1.82648 5.43794 Glister 0.6255 16.3 30.47 50.11 1.47556 2.18824 3.17468 7.099 CraigPB 2.54 16.23 28.43 46.42 1.7016 2.348 3.32867 7.42342 GradMatchPB 2.15 16.84 33.25 52.31 1.53448 2.1209 3.0624 7.03117 Milo (Fixed) 2.76 11.25 18.97 35.79 0.150245 0.91737 1.82648 5.4379 Milo 4.64 24.39 35.16 55.02 0.19173 0.91737 1.82648 5.4379 Table 5: Data Selection Results for Cifar10, Cifar100 and TinyImageNet datasets Test Accuracy Standard Deviation Results on Vision Datasets Test Accuracy Standard Deviation(for 5 runs) Budget(%) 1% 5% 10% 30% Dataset Model Selection Strategy CIFAR10 ResNet18 Full (skyline for test accuracy) 0.43 0.43 0.43 0.43 Random (skyline for training time) 0.187 0.127 0.173 0.154 Adaptive-Random (skyline for training time) 1.981 0.218 1.87 0.0837 Glister 0.76 1.872 5.8721 0.863 CraigPB 9.31 4.831 4.313 0.2841 GradMatchPB 2.041 1.0841 0.7631 0.0736 Milo (Fixed) 1.471 4.31 4.12 2.191 Milo 0.762 1.712 0.0821 0.037 CIFAR10 ResNet101 Full (skyline for test accuracy) 0.3606 0.3606 0.3606 0.3606 Random (skyline for training time) 0.141 0.1527 0.1424 0.077 Adaptive-Random (skyline for training time) 1.295 0.198 1.28 0.063 Glister 0.51 10.27 5.56 0.37 CraigPB 3.804 5.996 2.729 0.2051 GradMatchPB 2.086 0.5374 0.02 0.16 Milo (Fixed) 3.84 5.64 6.13 1.01 Milo 2.093 3.691 0.03 0.02 CIFAR100 ResNet18 Full (skyline for test accuracy) 0.3986 0.3986 0.3986 0.3986 Random (skyline for training time) 0.2848 0.9871 0.637 0.7168 Adaptive-Random (skyline for training time) 0.539 0.2546 0.176 0.3915 Glister 0.516 1.32 0.1473 0.08 CraigPB 0.5798 0.8132 0.3323 0.1626 GradMatchPB 0.7637 0.3536 1.039 0.6293 Milo (Fixed) 0.049 0.56 0.69 0.84 Milo 0.8211 0.3353 0.1041 0.39 CIFAR100 ResNet101 Full (skyline for test accuracy) 0.31 0.31 0.31 0.31 Random (skyline for training time) 0.2716 1.213 1.273 0.891 Adaptive-Random (skyline for training time) 0.831 0.628 0.583 0.846 Glister 0.631 1.172 0.31 1.23 CraigPB 0.831 1.731 0.347 0.541 GradMatchPB 0.642 0.24 0.53 0.62 Milo (Fixed) 0.192 0.0931 0.041 0.47 Milo 1.013 0.764 0.453 0.036 TinyImageNet ResNet18 Full (skyline for test accuracy) 0.51 0.51 0.51 0.51 Random (skyline for training time) 0.53 0.872 0.41 0.73 Adaptive-Random (skyline for training time) 0.41 0.763 0.31 0.63 Glister 0.125 0.45 0.45 0.17 CraigPB 2.31 0.47 0.51 0.21 GradMatchPB 0.73 0.23 0.46 0.53 Milo (Fixed) 0.32 0.43 0.24 0.53 Milo 0.53 0.41 0.32 0.41 TinyImageNet ResNet101 Full (skyline for test accuracy) 0.34 0.34 0.34 0.34 Random (skyline for training time) 0.65 0.43 0.41 0.53 Adaptive-Random (skyline for training time) 0.64 0.31 0.24 0.41 Glister 0.53 0.741 0.452 0.371 CraigPB 0.872 0.735 0.472 0.531 GradMatchPB 0.764 0.31 0.42 0.53 Milo (Fixed) 0.736 0.24 0.43 0.45 Milo 0.41 0.24 0.41 0.431 Table 6: Standard Results for Cifar10, Cifar100 and TinyImageNet datasets Model Training Results on Text Datasets Top-1 Test accuracy of the Model(%) Model Training time(in secs) Budget(%) 1% 5% 10% 30% 1% 5% 10% 30% Dataset Model Selection Strategy TREC6 LSTM Full (skyline for test accuracy) 86.6 86.6 86.6 86.6 101.835 101.835 101.835 101.835 Random (skyline for training time) 40.2 62.93 72.27 80.93 1.61 5.83 10.185 31.23 Adaptive-Random (skyline for training time) 28.13 61.0 81.53 85.03 1.623 5.82 10.171 31.46 Glister 24.52 65.45 84.21 86.21 12.621 17.825 19.452 46.742 CraigPB 25.53 52.87 83.47 85.43 33.83 39.841 43.957 60.742 GradMatchPB 23.04 69.93 85.0 87.0 8.621 14.363 18.553 43.848 Milo (Fixed) 34.2 66.07 69.26 81.05 1.614 5.72 10.174 31.22 Milo 46.87 76.13 84.3 87.97 1.613 5.81 10.182 31.21 IMDB LSTM Full (skyline for test accuracy) 89.03 89.03 89.03 89.03 613.553 613.553 613.553 613.553 Random (skyline for training time) 50.01 50.01 79.14 84.42 6.789 37.07 63.86 192.587 Adaptive-Random (skyline for training time) 49.99 60.04 88.13 88.29 6.673 37.14 63.88 193.632 Glister 50.02 61.25 85.21 86.81 110.242 137.256 166.839 300.512 CraigPB 50.02 66.73 87.54 87.7 118.553 137.256 166.839 295.567 GradMatchPB 50.02 58.99 84.66 85.75 99.116 137.256 166.839 347.057 Milo (Fixed) 50.03 66.54 81.3 85.06 6.791 37.06 63.86 192.59 Milo 50.02 69.15 88.45 88.56 6.81 37.1 63.89 192.57 Rotten Tomatoes LSTM Full (skyline for test accuracy) 79.54 79.54 79.54 79.54 177.656 177.656 177.656 177.656 Random (skyline for training time) 52.43 67.39 69.08 73.82 2.305 9.274 18.159 56.883 Adaptive-Random (skyline for training time) 49.91 68.44 78.22 79.47 1.8730 9.258 18.163 56.914 Glister 50.01 50.54 65.43 78.13 10.23 14.23 26.24 110.31 CraigPB 50.04 50.81 62.45 77.21 12.84 16.94 29.31 124.21 GradMatchPB 49.99 50.0 63.24 78.08 9.641 13.184 25.623 102.789 Milo (Fixed) 50.7 67.13 72.43 74.38 1.87 9.28 18.2 56.90 Milo 50.93 73.76 78.31 79.52 1.89 9.74 18.34 56.781 IMDB BERT+MLP Full (skyline for test accuracy) 93.27 93.27 93.27 93.27 5695.329 5695.329 5695.329 5695.329 Random (skyline for training time) 51.25 228.316 601.866 1672.21 87.362 89.58 90.33 91.91 Adaptive-Random (skyline for training time) 51.25 228.316 601.866 1672.21 89.813 92.18 92.41 92.93 Glister 86.25 91.21 92.01 92.43 721.43 789.21 898.31 1802.74 CraigPB 86.31 90.46 91.85 91.99 868.42 925.32 981.321 1831.31 GradMatchPB 86.54 90.42 91.99 91.83 682.987 758.215 892.21 1792.43 Milo (Fixed) 86.15 89.27 90.87 91.458 51.25 228.316 601.866 1672.21 Milo 90.76 92.06 92.6 93.144 51.25 228.316 601.866 1672.21 Table 7: Model Training Results for TREC6, IMDB and Rotten Tomatoes datasets Test Accuracy Standard Deviation on Text Datasets Standard deviation of the Model(for 5 runs) Budget(%) 1% 5% 10% 30% Dataset Model Selection Strategy TREC6 LSTM Full (skyline for test accuracy) 0.77 0.77 0.77 0.77 Random (skyline for training time) 3.41 1.23 0.57 1.34 Adaptive-Random (skyline for training time) 3.2 0.43 1.31 0.43 Glister 2.45 0.74 0.31 0.54 CraigPB 0.791 0.462 1.821 0.861 GradMatchPB 1.7265 0.8762 0.472 0.876 Milo (Fixed) 0.352 0.62 0.454 0.726 Milo 0.25 0.36 0.21 0.31 IMDB LSTM Full (skyline for test accuracy) 0.3 0.3 0.3 0.3 Random (skyline for training time) 0.21 0.31 0.41 1.2 Adaptive-Random (skyline for training time) 1.241 0.21 0.41 0.21 Glister 0.862 0.21 0.21 0.1 CraigPB 0.731 1.21 0.451 0.41 GradMatchPB 0.817 0.441 0.41 1.22 Milo (Fixed) 1.51 0.42 0.34 0.51 Milo 0.45 0.21 0.104 0.05 Rotten Tomatoes LSTM Full (skyline for test accuracy) 0.34 0.34 0.34 0.34 Random (skyline for training time) 1.21 1.41 1.97 0.31 Adaptive-Random (skyline for training time) 0.0871 0.16 0.31 0.21 Glister 0.761 0.43 0.34 0.31 CraigPB 0.826 0.12 0.43 0.41 GradMatchPB 0.52 0.2 0.21 0.12 Milo (Fixed) 0.7761 0.31 0.1 0.31 Milo 0.71 0.41 0.1 0.21 IMDB BERT+MLP Full (skyline for test accuracy) 0.14 0.14 0.14 0.14 Random (skyline for training time) 0.38 0.981 1.31 0.21 Adaptive-Random (skyline for training time) 0.21 0.24 0.31 0.51 Glister 0.871 0.31 0.53 0.72 CraigPB 0.87 0.66 0.2 0.731 GradMatchPB 0.54 0.53 0.23 0.61 Milo (Fixed) 0.21 0.21 0.12 0.3 Milo 0.43 0.12 0.3 0.12 Table 8: Standard Deviation Results for TREC6, IMDB and Rotten Tomatoes datasets H.5 Hyper-parameter Ordering Retention Analysis

We evaluate the effectiveness of Milo and the considered subset selection baselines in preserving original hyper-parameter ordering by full data tuning. In a nutshell, we are trying to determine whether the original order of the hyper-parameters is maintained, despite using small subsets for model training runs involved in hyper-parameter tuning. To examine this, we experiment on the Trec6 [Li and Roth, 2002, Hovy et al., 2001] dataset using an LSTM model. Hyper-parameter search for the TREC6 dataset includes a grid search over 108 configurations of the learning rate, optimizer, LSTM hidden size, training batch size, and the number of final fully connected layers. Table 9 shows the Kendall Tau correlation values between the hyper-parameter ordering obtained using 1%, 5%, and 10% subsets selected by Milo, Random, Adaptive-Random, Automata, and CraigPB and Full data hyper-parameter ordering on TREC6 [Li and Roth, 2002, Hovy et al., 2001] dataset. Results in Table 9 demonstrate that Milo is more effective than the baselines considered in preserving hyper-parameter ordering even when using small subsets.

Hyper-parameter Ordering Retention Capabilit Kendall Tau Values Dataset Model Budget Strategy TREC6 LSTM 1% Milo 0.4321 Random 0.0679 Adaptive-Random 0.313 Automata 0.3484 CraigPB 0.325 5% Milo 0.521 Random 0.199 Adaptive-Random 0.440 Automata 0.4764 CraigPB 0.4521 10% Milo 0.6342 Random 0.2605 Adaptive-Random 0.5313 Automata 0.4742 CraigPB 0.4531 Table 9: Mean test set accuracy of ResNet18 trained on Cifar100 dataset for subset sizes of 10%, and 30% selected using Milo for 200 epochs for different values of 𝑅 . H.6 Test-Accuracies, Training times, and Standard deviations for Hyper-Parameter Tuning Experiments

Table 10 shows the top-1 test accuracies and tuning times taken by Milo and the other baselines considered for single model training on Cifar10, TREC6 datasets for different subset sizes of 1%, 5%, 10%, and 30% respectively.

Hyper-parameter Tuning Results Top-1 Test accuracy of the Model(%) Tuning time(in hrs) Budget(%) 1% 5% 10% 30% 1% 5% 10% 30% Dataset Model Search+Scheduler Selection Strategy TREC6 LSTM Random+HB Full (skyline for test accuracy) 86.8 86.8 86.8 86.8 0.64 0.64 0.64 0.64 Random (skyline for training time) 40.8 60.2 78.0 85.5 0.01 0.03 0.07 0.18 Adaptive-Random (skyline for training time) 45.76 49.43 74.32 86.6 0.01 0.03 0.06 0.18 CraigPB 41.33 64.33 77.0 82.0 0.02 0.07 0.08 0.19 Automata 47.3 66.2 82.0 86.6 0.01 0.06 0.09 0.21 Milo (Fixed) 54.32 79.83 85.51 86.43 0.01 0.04 0.06 0.18 Milo 76.2 86.6 86.54 86.6 0.01 0.03 0.06 0.18 TREC6 LSTM TPE+HB Full (skyline for test accuracy) 84.4 84.4 84.4 84.4 0.64 0.64 0.64 0.64 Random (skyline for training time) 40.8 60.2 78 85.5 0.01 0.03 0.06 0.18 Adaptive-Random (skyline for training time) 52.32 64.32 79.2 83.42 0.01 0.04 0.06 0.18 CraigPB 41.33 64.33 77 82 0.02 0.07 0.08 0.19 Automata 47.2 66.2 82 86.6 0.01 0.06 0.09 0.21 Milo (Fixed) 54.32 79.83 82.51 85.43 0.01 0.03 0.06 0.18 Milo 68.2 84.26 84.34 86.46 0.01 0.03 0.06 0.18 CIFAR10 ResNet18 Random+HB Full (skyline for test accuracy) 95.35 95.35 95.35 95.35 36.11 36.11 36.11 36.11 Random (skyline for training time) 87.21 88.95 90.21 94.21 0.4779 1.8681 3.7917 10.900 Adaptive-Random (skyline for training time) 93.1 94.32 94.78 95.21 0.47790 1.86814 3.79172 10.9002 CraigPB 92.43 93.5 94.2 94.3 0.5434 1.92925 3.83061 11.06829 Automata 93.7 93.8 94.21 95.31 0.67548 1.91481 3.81394 11.0127 Milo (Fixed) 86.21 88.95 94.21 95.21 0.47790 1.868144 3.791 10.909 Milo 94.34 95.34 95.21 95.34 0.4778 1.8644 3.796 10.9889 CIFAR10 ResNet18 TPE+HB Full (skyline for test accuracy) 95.24 95.24 95.24 95.24 36.12 36.12 36.12 36.12 Random (skyline for training time) 86.38 89.32 92.21 94.21 0.48 1.87 3.79 10.9 Adaptive-Random (skyline for training time) 93.12 93.72 94.78 95.21 0.49 1.89 3.78 11.1 CraigPB 92.23 93.5 94.2 94.3 0.54 1.93 3.83 11.07 Automata 93.7 93.98 94.21 95.31 0.68 1.91 3.81 11.01 Milo (Fixed) 87.11 90.21 94.21 95.21 0.48 1.87 3.79 10.9 Milo 93.24 94.28 95.1 95.21 0.52 1.89 3.77 10.89 Table 10: Hyper-parameter Tuning Results Appendix I Ablation Studies I.1 Performance Comparison of Existing Pre-trained Transformer Models as Feature Encoders (a) Comparison of Vision Transformers for Subset Selection (b) Comparison of Language Models for Subset Selection Figure 11: Sub-figure (a) shows the performance of the ResNet18 model trained on a fixed 5% subset of the Cifar100 dataset selected by maximizing Facility Location function using different pre-trained vision transformer models. Sub-figure (b) shows the performance of the LSTM model trained on a fixed 5% subset of the TREC6 dataset selected by maximizing the Facility Location function using different pre-trained language models.

In this ablation study, we evaluate the effectiveness of various existing pre-trained transformer models as feature encoders for subset selection.

Optimal Feature Encoders for Vision Datasets: For vision datasets, we test with Dino-ViTb16 [Caron et al., 2021], vit-large-patch16-224-in21k [Dosovitskiy et al., 2021], and clip-ViT-L-14 [Radford et al., 2021] models available from the HuggingFace [Wolf et al., 2019] as the feature encoder. For CLIP model, we use CLIP VIT’s final projection layer output embedding as the feature representation. For both DINO and ViT models, we test with two different types of embeddings as feature vectors: a) we use the final layer CLS token embedding output as the feature representation; b) we use the average of the final layer output embeddings of the constituent patches as the feature representation. We denote the DINO model using CLS token output as a feature representation vector by DINO (CLS). Similarly, we denote the ViT model using CLS token output as a feature representation vector by ViT (CLS). Sub-figure 10(a) shows the performance of the ResNet18 model trained on a fixed 5% subset of the Cifar100 dataset selected by maximizing the facility location function using different pre-trained vision transformer models as the feature encoder. In this experiment, we used the facility location set function because when using fixed subsets of small subset sizes, facility location yielded optimal results (See Sub-figure 3(a)). The results presented in Sub-figure 10(a) indicate that the DINO model using the final layer CLS token embedding as the feature representation gave the best model performance. Hence, in our experiments, we use the DINO model and compute the feature representations for images by using the final layer CLS token output embedding.

Optimal Feature Encoders for Text Datasets: For text datasets, we test with all-distilroberta-v1 [Liu et al., 2020], and all-mpnet-base-v2 [Song et al., 2020] models available from the the Sentence Transformers [Reimers and Gurevych, 2019] as the feature encoder. We employ pre-trained models from the Sentence Transformers package, as they have been fine-tuned to produce improved representations that accurately compute the similarity between phrases for natural language inference (NLI) tasks. Similar to the SBERT [Reimers and Gurevych, 2019] work, we compute the sentence representations by taking the average of the final layer output embeddings of all the constituent tokens. Sub-figure 10(a) shows the performance of the LSTM model trained on a fixed 5% subset of the TREC6 dataset selected by maximizing the facility location function using different pre-trained language models as the feature encoder. In this experiment, we used the facility location set function because when using fixed subsets of small subset sizes, facility location yielded optimal results (See Sub-figure 3(a)). Results presented in the Sub-figure 10(b) show that all-distilroberta-v1 resulted in better model performance compared to all-mpnet-base-v2. Hence, in our experiments, we use the all-distilroberta-v1 as the feature encoder for text datasets.

I.2 Optimal Similarity Metric Analysis

In this experiment, we aim to analyze the performance of different similarity metrics for the computation of similarity between data samples. As part of our experiment, we evaluate the following similarity metrics:

Cosine Similarity: Given two data samples 𝑒 1 , 𝑒 2 having feature representations 𝑟 1 , 𝑟 2 , the similarity between the data samples using the cosine-similarity is as follows:

Cosine − Similarity ⁡ ( 𝑟 1 , 𝑟 2 )

0.5 + 0.5 ⋅ 𝑟 1 ⋅ 𝑟 2 ‖ 𝑟 1 ‖ ⁢ ‖ 𝑟 2 ‖ (10)

We perform additive scaling of the cosine similarity to obtain non-negative similarity values.

Dot Product: Given two data samples 𝑒 1 , 𝑒 2 having feature representations 𝑟 1 , 𝑟 2 , the similarity between the data samples using the dot-product is 𝑟 1 ⋅ 𝑟 2 . We also perform additive scaling to the obtained dot product values to ensure that all the pair-wise similarity values are non-negative.

RBF Kernel: Given two data samples 𝑒 1 , 𝑒 2 having feature representations 𝑟 1 , 𝑟 2 , the similarity between the data samples using the RBF kernel is as follows:

RBF − Kernel ⁡ ( 𝑟 1 , 𝑟 2 )

exp ⁡ ( − ‖ 𝑟 1 − 𝑟 2 ‖ 2 𝑘 ⁢ 𝑤 * 𝑚 ⁢ 𝑒 ⁢ 𝑎 ⁢ 𝑛 ⁢ _ ⁢ 𝑑 ⁢ 𝑖 ⁢ 𝑠 ⁢ 𝑡 ) (11)

In the above equation, the parameter 𝑘 ⁢ 𝑤 is a hyper-parameter that controls the saturation of the similarity between data samples and the parameter 𝑚 ⁢ 𝑒 ⁢ 𝑎 ⁢ 𝑛 ⁢ _ ⁢ 𝑑 ⁢ 𝑖 ⁢ 𝑠 ⁢ 𝑡 is the mean distance of the pair-wise distance between all the samples in the dataset. In the case of small 𝑘 ⁢ 𝑤 values, the similarity between samples of data drops dramatically with even a small increase in distance between them. In contrast with high 𝑘 ⁢ 𝑤 values, the similarity between samples of data drops slowly with an increase in distance between them. We also test with different 𝑘 ⁢ 𝑤 values of 0.01, 0.05, 0.1, 0.5, and 1 in this experiment.

Similarity Metric for Vision Datasets: Table 11 shows the test accuracies of a ResNet18 model trained on a 5% fixed Cifar100 subset selected by maximizing the facility location function using the DINO model with final layer CLS token outputs as feature representations using different similarity metrics. Results demonstrate that with DINO model using cosine-similarity resulted in a better-performing model compared to other similarity metrics. Hence in our work for vision experiments, we use cosine-similarity as the similarity metric. We do not show results with dot-product in Table 11 as the CLS token embedding outputs of the DINO model are normalized because of which both dot-product and cosine similarity gives the same similarity values.

Similarity Metric Test Accuracy Cosine Similarity 28.68 ± 0.1103 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.01) 27.14 ± 0.2051 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.05) 27.88 ± 0.8132 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.1) 28.11 ± 0.1499 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.5) 27.61 ± 0.601 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.1) 27.62 ± 0.6576 Table 11: Comparison of accuracies of ResNet18 model trained on a 5% fixed Cifar100 subset selected by maximizing the facility location function using the DINO model with CLS token outputs as feature representations using different similarity metrics.

Similarity Metric for Text Datasets: Table 12 shows the test accuracies of the LSTM model trained on a 5% fixed TREC6 subset selected by maximizing the facility location function using the all-distilroberta-v1 as the feature encoder using different similarity metrics. Results demonstrate that with all-distilroberta-v1 model using cosine-similarity resulted in better performing model compared to other similarity meterics. Hence in our work for text experiments, we use cosine-similarity as the similarity metric.

Similarity Metric Test Accuracy Cosine Similarity 72.1 ± 2.1508 Dot Product 70.38 ± 4.5514 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.01) 63.58 ± 9.8051 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.05) 67.58 ± 2.561 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.1) 66.92 ± 2.715 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.5) 70.7 ± 2.291 RBF Kernel ( 𝑘 ⁢ 𝑤 =0.1) 70.54 ± 4.518 Table 12: Comparison of accuracies of LSTM model trained on a 5% fixed TREC6 subset selected by maximizing the facility location function using the all-distilroberta-v1 as the feature encoder using different similarity metrics. I.3 SGE (Graph-Cut) vs SGE (Facility Location)

Figure 12 shows the model convergence curves trained using SGE with Graph-Cut and SGE with Facility Location on various datasets using different subset sizes. Results show that SGE with Graph-Cut achieves faster initial convergence than SGE with Facility Location across all the datasets considered and for different subset sizes. The reason why SGE with Graph-Cut gives superior initial model convergence compared to SGE with Facility Location is that the sum-sum formulation in graph-cut( Equation (7)), as opposed to the sum-max formulation in facility location( Equation (6)), results in the selection of more number of easy samples. More specifically, with the sum-max formulation for facility location, having a single representative sample of data from each cluster in the dataset is sufficient to achieve a high facility location value for that cluster and having a representative sample from all the clusters in the dataset leads to further maximization of facility location values. This prevents the selection of samples only from densely populated regions while achieving broader coverage of the selected dataset in the facility location. Whereas, with the sum-sum formulation of the graph-cut function, selecting more samples from dense regions, leads to the high graph-cut function values. Thus, graph-cut can result in the selection of subsets consisting of higher number of easy samples from very dense regions in the dataset than facility location.

(a) Cifar100(5%) ResNet18 \phantomcaption (b) Cifar10(5%) ResNet18 \phantomcaption (c) TinyImageNet(5%) ResNet18 \phantomcaption (d) TREC6(30%) LSTM \phantomcaption (e) IMDB (30%) LSTM \phantomcaption (f) Rotten Tomatoes (10%) LSTM \phantomcaption Figure 12: Comparison of initial convergence of SGE with Graph-Cut and SGE with Facility Location on a variety of datasets using different subset sizes. Results show that SGE with Graph-Cut achieves faster initial convergence compared to SGE with Facility Location across all the datasets considered and for different subset sizes. I.4 SGE (Graph-Cut) vs. WRE (Graph-Cut)

Figure 13 shows the model convergence curves trained using SGE with Graph-Cut and WRE with Graph-Cut on various datasets using different subset sizes. Results show that SGE with Graph-Cut achieves faster initial convergence than WRE with Graph-Cut across all the datasets considered and for different subset sizes. The reason why SGE with Graph-Cut gives superior initial model convergence compared to WRE with Graph-Cut is that SGE emphasizes more exploitation than WRE. More specifically, SGE results in selecting subsets with high set function values because of the approximation guarantees of the stochastic-greedy algorithm [Mirzasoleiman et al., 2015]. Whereas with WRE, there is no guarantee that the sampled subsets using the constructed probability distribution 𝒑 have set function values. Since subsets with higher Graph-Cut function values correspond to subsets with more easy samples, SGE with Graph-Cut achieves superior model convergence than SGE with Facility Location.

(a) Cifar100(5%) ResNet18 \phantomcaption (b) Cifar10(5%) ResNet18 \phantomcaption (c) TinyImageNet(5%) ResNet18 \phantomcaption (d) TREC6(30%) LSTM \phantomcaption (e) IMDB (30%) LSTM \phantomcaption (f) Rotten Tomatoes (10%) LSTM \phantomcaption Figure 13: Comparison of initial convergence of SGE with Graph-Cut and WRE with Graph-Cut on a variety of datasets using different subset sizes. Results show that SGE with Graph-Cut achieves faster initial convergence compared to WRE with Graph-Cut across all the datasets considered and for different subset sizes. Tuning of the hyper-parameter: 𝜅

Mean Test Accuracy of the Model(for 5 runs)
    Graphcut Interval

1 12

1 10

1 8

1 6

1 4

1 2

Dataset Model Budget Cifar100 ResNet101 1% 4.116% 9.27 % 11.795 % 12.135% 12.985% 10.68% 11.11% 3.48% 5% 45.19% 45.7% 45.43% 47.15% 49.885% 49.01% 48.77% 35.76% 10% 60.4% 64.08% 64.07% 64.84% 66.11% 64.97% 62.43% 53.41% 30% 74.28% 74.39% 75.24% 74.27% 75.28% 74.68% 73.18% 71.48% Cifar10 ResNet101 1% 38.18% 39.23% 34.61% 38.25% 43.37% 43.92% 38.885% 38.76% 5% 71.77% 74.13% 77.98% 75.14% 81.63% 77.87% 75.18% 63.9% 10% 87.21% 85.74% 87.32% 87.96% 88.72% 88.17% 86.61% 83.25% 30% 93.79% 93.82% 93.52% 93.75% 94.24% 93.73% 93.5% 91.23% Table 13: Mean test set accuracy of ResNet101 trained on Cifar10 and Cifar100 dataset with different subset sizes (of 1%, 5%, 10%, and 30%) and 200 epochs using curriculum-based data exploration with various curriculum intervals. (a) ResNet18 Convergence on Cifar10 using 5% subsets (b) ResNet18 Convergence on TinyImageNet using 5% subsets Figure 14: Sub-figure (a) shows the convergence rate of the ResNet18 model on Cifar10 dataset using 5% subsets selected by SGE with Graph-Cut, WRE with Disparity-Min, and Milo with curriculum exploration. Sub-figure (b) shows the convergence rate of the ResNet18 model on TinyImageNet dataset using 5% subsets selected by SGE with Graph-Cut, WRE with Disparity-Min, and Milo with curriculum exploration. I.5 Advantages of using a Curriculum for Data Exploration

Results given in Table 13 show that 𝜅

0 (WRE with disparity-min) and 𝜅

1 (SGE with graph-cut) perform poorly compared to using a curriculum 𝜅

0 , thereby proving the efficiency of curriculum-based data exploration over WRE in achieving better-performing models. Further, we also showcase the efficiency of curriculum-based data exploration in achieving faster convergence by showing the convergence curves of the ResNet18 model trained on the TinyImageNet, Cifar10 datasets using 5% subsets in Figure 14. The convergence curves show that curriculum-based data exploration converges faster than WRE with disparity-min while achieving better performance than SGE with the graph-cut function.

I.5.1 Finding Optimal Curriculum for Data Exploration

In order to find the optimal value of 𝜅 , we test for a number of different values that represent the fraction of the total number of epochs for which the model is initially trained using SGE with the graph-cut function. We present the mean test accuracies of the ResNet101 model trained for 200 epochs on different subset sizes of the Cifar10 and Cifar100 datasets in Table 13. Based on our observations, 𝜅

1 6 is the optimal value and results in higher model accuracy. In Table 13, the results given in column 𝜅

0 correspond to the ResNet101 model trained by simply using WRE with the disparity-min set function without any curriculum. At the same time, using very large values of 𝜅 prioritizes more SGE and results in poor performance as SGE is shown to be less effective than WRE based on the results given in Sub-figure 5.

I.6 Optimal 𝑅 analysis

In order to find the optimal 𝑅 value( 𝑅 signifies how frequently we select a new subset using Milo ), we experiment with 𝑅 values of [1, 2, 5, 10] on the Cifar100 dataset using ResNet18 model for subsets sizes of 10%, and 30%. Mean-test accuracy of the ResNet18 model obtained using different 𝑅 values is presented in Table 14. Results show that using 𝑅

1 ,i.e., selecting a new subset every epoch, results in a better-performing model than using higher values of 𝑅 . Further, with small subset sizes, the gap between the model’s performance obtained using 𝑅

1 and other values of 𝑅 is even more significant. This shows that it is paramount to select new subsets as frequently as possible when using small subset sizes to achieve the best possible performance.

Tuning of the hyper-parameter: 𝑅

Mean Test Accuracy of the Model(for 5 runs)
    R

Dataset Model Budget Cifar100 ResNet18 10% 69.28% ± 0.1041 69.21% ± 3.82 66.63% ± 3.38 64.07% ± 3.32 30% 74.95% ± 0.39 74.13% ± 0.36 73.55% ± 0.16 72.72% ± 0.01 Table 14: Mean test set accuracy of ResNet18 trained on Cifar100 dataset for subset sizes of 10%, and 30% selected using Milo for 200 epochs for different values of 𝑅 . I.7 Effectiveness of WRE

In our previous discussions, we delved into how Stochastic Greedy Exploration (SGE) tends to favor subsets with near-optimal values, providing room for a modest degree of exploration. In stark contrast, Weighted Random Exploration (WRE) advocates for a more thorough exploration of data while simultaneously prioritizing the inclusion of highly informative samples.

In a bid to reinforce the effectiveness of WRE, we carry out an empirical comparison with a variant of SGE, particularly one that promotes increased exploration. The operational procedure of this SGE variant is as follows: We earmark a subset size 𝑘 ′ to be selected by SGE, where 𝑘 ′ ≤ 𝑘 and 𝑘 symbolize the overall budget. The balance of the samples are selected at random. As the training sequence progresses, the size of the subset chosen by SGE ( 𝑘 ′ ) is steadily reduced, thereby facilitating a broader scope for exploration.

Dataset Model Fraction Strategy Test Accuracy CIFAR100 ResNet18 0.05 MILO 64.4 SGE Variant (more exploration) 59.02 0.1 MILO 69.28 SGE Variant (more exploration) 66.24 Table 15: Comparison of Test Accuracy Between MILO and SGE Variant (that favors more exploration) on CIFAR100 with ResNet18 Model to validate the effectiveness of WRE in MILO Dataset Model Fraction Strategy Test Accuracy CIFAR10 ResNet18 0.05 MILO 88.37 SGE Variant (more exploration) 84.38 0.1 MILO 91.11 SGE Variant(more exploration) 88.88 Table 16: Comparison of Test Accuracy Between MILO and SGE Variant (that favors more exploration) on CIFAR10 with ResNet18 Model to validate the effectiveness of WRE in MILO

We contrast the performance of WRE with the aforementioned SGE variant, which integrates samples chosen by SGE with randomly selected samples. In the case of the SGE variant, we utilize a cosine decay rate, allowing the ratio of SGE-chosen samples to total samples to decay from 1 to 0 over the course of the training period.

The results obtained from the CIFAR100 and CIFAR10 for both WRE and the SGE variant are presented in Tables 15 and 16. The outcomes suggest that WRE proves to be more effective in comparison to SGE, even when the latter allows for increased exploration. Consequently, we provide empirical evidence to suggest that WRE serves as a critical component of the Milo, contributing significantly to enhanced convergence.

I.8 Comparison with Self-supervised Data Pruning Metric

The self-supervised data pruning metric [Sorscher et al., 2022] is applied to prune a static subset from the entire dataset. As stated previously in Section 3, the utilization of such fixed data subsets necessitates selecting larger proportions in order to approach optimal performance levels, akin to full data training. This requirement, unfortunately, compromises efficiency.

Dataset Model Fraction Strategy Test Accuracy SpeedUp CIFAR100 ResNet18 0.3 MILO 74.95 3.4 0.3 Self-Supervised Metric 60.41 3.4 0.7 Self-Supervised Metric 74.31 1.4 Table 17: Comparison of MILO with self-supervised pruning metric [Sorscher et al., 2022] on the CIFAR100 dataset using ResNet18 model.

We proceed to detail the results obtained from the CIFAR100 dataset using the ResNet18 model, in which we compare MILO and the self-supervised data pruning metric [Sorscher et al., 2022]. The findings, presented in Table 17, indicate a more substantial loss in performance when utilizing a 30% fixed subset as determined by the self-supervised data pruning metric, as compared to using MILO.

Moreover, to attain performance levels similar to those achieved by Milo using a 30% subset, it becomes necessary to select a 70% subset via the self-supervised data pruning metric. This, in turn, results in forfeiting the speedups achieved.

Generated on Thu Jul 13 18:11:41 2023 by LATExml gXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]">

Xet Storage Details

Size:: 128 kB
Xet hash:: 994b50084d73cd27dc41feaac35afcfea1c4f94bfe88dced0059085df694fa9e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Notation: We briefly describe the notation for various variables that will be used throughout the remainder of this work. Denote the training dataset as 𝒟

{ ( 𝑥 𝑗 , 𝑦 𝑗 ) } 𝑗

𝒮 *

arg ⁢ max 𝒮 : 𝒮 ⊆ 𝒟 , | 𝒮 |

𝑃 ⁢ ( 𝒮 ) ∝ exp ⁡ ( 𝛽 ⋅ 𝑓 ⁢ ( 𝒮 ) ) ⁢ subject ⁢ to ⁢ | 𝒮 |

𝒈

𝒑

Taylor − Softmax ⁡ ( 𝒈 )

[ 1 + 𝑔 𝑖 + 0.5 ⁢ 𝑔 𝑖 2 ∑ 𝑗

1 𝑚 1 + 𝑔 𝑗 + 0.5 ⁢ 𝑔 𝑗 2 ] 𝑖

Input: Train set: 𝒟 ; initial params: 𝜃 0 ; learning rate: 𝛼 ; total epochs: 𝑇 ; selection interval: 𝑅 ; SGE Fraction: 𝜅 ; Training Loss: 𝐿 𝑇 , Mini-batch size: 𝐵 , Subset size: 𝑘 Initialize Graph-Cut function: 𝑓 1 ⁢ ( 𝒮 )

∑ 𝑖 ∈ 𝒟 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 − 0.4 ⁢ ∑ 𝑖 ∈ 𝒮 ∑ 𝑗 ∈ 𝒮 𝑠 𝑖 ⁢ 𝑗 Initialize Disparity-Min function: 𝑓 2 ⁢ ( 𝒮 )

loadmetadata ⁡ ( 𝒟 ) else *** Stochastic Greedy Exploration with Graph-Cut *** 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅

SGE ⁡ ( 𝑓 1 , 𝒟 , 𝑘 ) *** Weighted Random Exploration with Disparity-Min *** 𝒈

[ 𝑔 0 , 𝑔 1 , ⋯ , 𝑔 | 𝒟 | ]

GreedySampleImportance ⁡ ( 𝑓 2 , 𝒟 ) 𝒑

[ 1 + 𝑔 𝑖 + 0.5 ⁢ 𝑔 𝑖 2 ∑ 𝑗

1 | 𝒟 | 1 + 𝑔 𝑗 + 0.5 ⁢ 𝑔 𝑗 2 ] 𝑖

1 | 𝒟 | Stored selected subsets and constructed probability as metadata storemetadata ⁡ ( 𝒟 , 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅 , 𝒑 ) *** Model training with subset curriculum *** Initialize the subset: 𝒮

𝒮 0 *** Model training on representative/easy subsets *** for epochs 𝑡 in 0 , ⋯ , 𝜅 ⁢ 𝑇 − 1 do if ( 𝑡 mod 𝑅

= 0 ) then Update the Subset: 𝒮

𝒮 𝑡 𝜃 𝑡 + 1

mini-BatchSGD ⁢ ( 𝒮 , 𝛼 , 𝐿 𝑇 , 𝐵 , Epochs

1 ) *** Model training on diverse/hard subsets *** for epochs 𝑡 in 𝜅 ⁢ 𝑇 , ⋯ , 𝑇 − 1 do if ( 𝑡 − 𝜅 𝑇 mod 𝑅

= 0 ) then Sample subset 𝒮 𝑡 from dataset 𝒟 using the probability 𝒑 without replacement Update the Subset: 𝒮

𝒮 𝑡 𝜃 𝑡 + 1

mini-BatchSGD ⁢ ( 𝒮 , 𝛼 , 𝐿 𝑇 , 𝐵 , Epochs

1 ) return Final model parameters 𝜃 𝑇 Algorithm 1 Milo Algorithm Input: Train set: 𝒟 ; Subset Size: 𝑘 ; Set function: 𝑓 ; 𝜖

1 ⁢ 𝑒 − 2 ; Number of Subsets: 𝑛 *** Sampling n subsets using stochastic-greedy *** for 𝑖 ∈ 0 , ⋯ ⁢ 𝑛 − 1 do Initialize empty subset: 𝒮 𝑖

∅ Set random subset size for the stochastic-greedy algorithm: 𝑠

| 𝒟 | 𝑘 ⁢ log ⁡ ( 1 𝜖 ) for 𝑗 ∈ 0 , ⋯ , 𝑘 − 1 do *** Sample a random subset by sampling 𝑠 elements from 𝒟 ∖ 𝒮 𝑖 *** 𝑅 ← sample ⁡ ( 𝒟 ∖ 𝒮 𝑖 , 𝑠 ) 𝑒

arg ⁢ max 𝑒 ∈ 𝑅 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝒮 𝑖

𝒮 𝑖 ∪ { 𝑒 } return 𝒮 0 , ⋯ , 𝒮 𝑛 − 1 Algorithm 2 SGE Input: Train set: 𝒟 ; Subset Size: 𝑘 ; Set function: 𝑓 Initialize empty subset: 𝒮

∅ Initialize gains vector of dimension | 𝒟 | to zeros 𝒈

0 → *** Calculate greedy informative gains for each element *** for 𝑗 ∈ 0 , ⋯ , | 𝒟 | − 1 do 𝑔

max 𝑒 ∈ 𝒟 ∖ 𝒮 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝑒

arg ⁢ max 𝑒 ∈ 𝒟 ∖ 𝒮 ⁢ 𝑓 ⁢ ( 𝐴 ∪ 𝑒 ) − 𝑓 ⁢ ( 𝐴 ) 𝒈 ⁢ [ 𝑒 ]

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

𝑓 ⁢ ( 𝒮 )

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

𝑓 ⁢ ( 𝒮 )

The 𝜆 parameter in the above equation controls the trade-off between diversity and representation. When 𝜆 is large, the graph cut function also incorporates diversity into its model. In our experiments, we set 𝜆

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

𝑓 ⁢ ( 𝒮 )

Given, a dataset 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

𝑓 ⁢ ( 𝒮 )

Cosine − Similarity ⁡ ( 𝑟 1 , 𝑟 2 )

RBF − Kernel ⁡ ( 𝑟 1 , 𝑟 2 )

Results given in Table 13 show that 𝜅

0 (WRE with disparity-min) and 𝜅

1 6 is the optimal value and results in higher model accuracy. In Table 13, the results given in column 𝜅

1 ,i.e., selecting a new subset every epoch, results in a better-performing model than using higher values of 𝑅 . Further, with small subset sizes, the gap between the model’s performance obtained using 𝑅

Xet Storage Details

loadmetadata ⁡ ( 𝒟 ) else * Stochastic Greedy Exploration with Graph-Cut * 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅

SGE ⁡ ( 𝑓 1 , 𝒟 , 𝑘 ) * Weighted Random Exploration with Disparity-Min * 𝒈

1 | 𝒟 | Stored selected subsets and constructed probability as metadata storemetadata ⁡ ( 𝒟 , 𝒮 0 , 𝒮 𝑅 , 𝒮 𝜅 ⁢ 𝑇 − 𝑅 , 𝒑 ) * Model training with subset curriculum * Initialize the subset: 𝒮

𝒮 0 * Model training on representative/easy subsets * for epochs 𝑡 in 0 , ⋯ , 𝜅 ⁢ 𝑇 − 1 do if ( 𝑡 mod 𝑅

1 ) * Model training on diverse/hard subsets * for epochs 𝑡 in 𝜅 ⁢ 𝑇 , ⋯ , 𝑇 − 1 do if ( 𝑡 − 𝜅 𝑇 mod 𝑅

1 ⁢ 𝑒 − 2 ; Number of Subsets: 𝑛 * Sampling n subsets using stochastic-greedy * for 𝑖 ∈ 0 , ⋯ ⁢ 𝑛 − 1 do Initialize empty subset: 𝒮 𝑖

| 𝒟 | 𝑘 ⁢ log ⁡ ( 1 𝜖 ) for 𝑗 ∈ 0 , ⋯ , 𝑘 − 1 do * Sample a random subset by sampling 𝑠 elements from 𝒟 ∖ 𝒮 𝑖 * 𝑅 ← sample ⁡ ( 𝒟 ∖ 𝒮 𝑖 , 𝑠 ) 𝑒

0 → * Calculate greedy informative gains for each element * for 𝑗 ∈ 0 , ⋯ , | 𝒟 | − 1 do 𝑔