103 kB

Title: Ingenious: Using Informative Data Subsets for Efficient Pre-Training of Language Models

URL Source: https://arxiv.org/html/2305.06677

Markdown Content: H S V N S Kowndinya Renduchintala 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Krishnateja Killamsetty 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Sumit Bhatia 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT

Milan Aggarwal 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Ganesh Ramakrishnan 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Rishabh Iyer 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Balaji Krishnamurthy 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Indian Institute of Technology Bombay, India

2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT University of Texas at Dallas, USA

3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Media and Data Science Research (MDSR) Lab, Adobe Inc., India

Abstract

A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to ∼99%similar-to absent percent 99\sim 99%∼ 99 % of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.

Ingenious: Using Informative Data Subsets for Efficient Pre-Training of Language Models

1 Introduction

Pre-trained language models (PTLMs)(Devlin et al., 2019; Radford et al., 2019; Yang et al., 2020; Brown et al., 2020; Raffel et al., 2020) have revolutionized the field of natural language processing (NLP), becoming the default choice for a wide array of NLP tasks. The versatility of PTLMs, however, is accompanied by significant costs. For instance, it costs an estimated $12 million to train GPT-3(Brown et al., 2020) with roughly 1.2 million pounds of CO 2 2{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT emissions(Kahn, 2021). Megatron-Turing NLG(Smith et al., 2022) is a 530 billion parameter PTLM, which is thrice the size of GPT-3 and is trained on 4480 A100 GPUs and yields close to 1% performance improvements over GPT-3. By continually increasing the size of PTLMs and pre-training corpora to improve generalization ability, significant additional resources and energy are consumed, resulting in dire environmental consequences(Sharir et al., 2020). Further, such large-scale resource utilization and the costs associated with PTLMs create an uneven playing field for small organizations and universities, which operate with significant resource constraints. Hence, a crucial step towards developing responsible, fair, and GreenAI(Schwartz et al., 2020) involves minimizing inefficiencies and costs of training these models.

Significant efforts toward improving the efficiency of PTLMs have ventured in directions such as optimizing the model architecture(Chen et al., 2020; Gordon et al., 2020; Zafrir et al., 2021), modifications to the training pipeline Izsak et al. (2021); Shen et al. (2022) and task(Schick and Schütze, 2021), sample efficient masking techniques for improved convergence(Bitton et al., 2021) and leveraging contextual knowledge to reduce model size(Kaur et al., 2022). In this work, driven by the observation that the scale of the pre-training corpus contributes significantly to the training costs of PTLMs, we explore the feasibility of training PTLMs using highly informative subsets of the corpus. Recent studies have demonstrated the feasibility of informative data subset selection for efficient deep model training for images(Mirzasoleiman et al., 2020; Killamsetty et al., 2021a, b, c; Pooladzandi et al., 2022) in both supervised and semi-supervised settings. In light of this, the key question we attempt to answer is: Can we efficiently pre-train language models using highly informative subsets of the training corpus without compromising performance?

The first step in answering the above question is identifying informative (or representative) subsets of the underlying training corpus such that they maximize the representation of the remaining samples in the corpus. Intuitively, given a set of sentences, the subsequent addition of sentences similar to existing sentences in the set yields diminishing returns. More information gains can be achieved by adding diverse, dissimilar sentences. While the classical subset selection problem is NP-hard, we can leverage the diminishing gains property of submodular functions(Fujishige, 2005) and frame subset selection as a submodular maximization problem. Several recent works(Wei et al., 2015; Mirzasoleiman et al., 2020; Kothawade et al., 2021; Karanam et al., 2022; Maheshwari et al., 2020) have formulated the subset selection problem as that of maximizing a submodular objective. However, applying existing subset selection frameworks to PTLMs is non-trivial given the scale of corpora typically used for pre-training (e.g., Wikipedia and Common Crawl consisting of hundreds of millions of sequences and billions of tokens). Most of the existing methods rely on per-sample gradients, which are expensive to compute, and to the best of our knowledge, none of the previous works have considered subset selection for such large datasets.

Our contributions: We propose the informative data subset selection task for efficient pre-training of PTLMs and present Ingenious, a framework for subset selection using submodular optimization (Section3). We show how to overcome the scalability challenge for typical large-scale pre-training corpora and employ scalable sentence feature encoders to obtain individual data sample features relevant for subset selection. We also employ various engineering techniques to scalably select subsets from large-scale datasets (Section3). We use Ingenious to pre-train BERT and GPT-2 and evaluate the performance of the resulting models on downstream tasks(Section4). A rigorous empirical evaluation reveals that the models pre-trained with Ingenious retain upto ≈\approx≈ 99% performance of the models pre-trained using the full dataset. Figure1 summarizes the cost-savings vs performance trade-off achieved by Ingenious for BERT pre-training. We also present thorough ablation studies revealing the impact of various design choices and parameters involved. We also evaluate the models trained by Ingenious in terms of their knowledge retention capabilities and show how Ingenious can be used to accelerate pre-training of domain-specific language models such as BioBERT (Section4.4). Finally, we discuss the inferences that could be drawn from our work, limitations of our proposed framework and lay out directions for further improvement (Section5).

Figure 1: Cost-savings vs Performance tradeoff achieved by Ingenious for BERT pre-training: We contrast the accuracy degradation with cost savings compared to the vanilla BERT pre-training on entire dataset. We observe 4.35×4.35\times 4.35 × cost-savings with 2.1%percent 2.1 2.1%2.1 % accuracy drop and 2.33×2.33\times 2.33 × cost-savings with 0.67%percent 0.67 0.67%0.67 % accuracy drop.

2 Related Work

Knowledge distillation and pruning based methods(Sanh et al., 2019; Jiao et al., 2020; Muhamed et al., 2021) pre-train a smaller variant of PTLMs (such as BERT) with lesser capacity using the full model as teacher network. Even though lighter versions such as DistilBERT(Sanh et al., 2019) retain ≈\approx≈ 97% of the performance with up to 60% faster inference, the PTLM still needs to be completely pre-trained initially to be able to distill the lighter version. Thus, the efficiency gains are restricted only to the fine-tuning and inference. Other methods prune the architecture through forcing the weights with lesser magnitude to zero value during pre-training(Chen et al., 2020; Gordon et al., 2020) as well as during fine-tuning(Zafrir et al., 2021).

Model architecture and training task optimizations: Schick and Schütze (2021) have shown that smaller PTLMs can achieve better performance by formulating the task input in cloze style. Izsak et al. (2021) proposed to optimize BERT pre-training through multiple optimizations related to data, model size, and optimizer choice. Shen et al. (2022) proposed a staged training mechanism where they start with training a relatively smaller model, which is then used for initializing the full capacity model at a later stage. Yao et al. (2022) identify relevant samples from the pre-training corpus based on their similarity with the task-specific dataset to train task-specific PTLM followed by fine-tuning, thus inherently suffering from the limitation of pre-training separate models for every downstream task.

Curriculum learning based methods employ the sequence length of training samples as a proxy for hardness.Typically, shorter (easier) sequences are presented in the initial stages of pre-training followed by longer (harder) sequences at later stages(Nagatsuka et al., 2021; Li et al., 2022). However, such methods have been shown to perform well only in limited configurations with respect to the choice of language models, stage of pre-training, etc..

Hardware optimizations for PTLM Training: The suite of Open Pre-Trained Transformers (OPT) (Zhang et al., 2022) require 1/7th of the carbon footprint for pre-training when compared to popular PTLMs such as GPT-3 (Brown et al., 2020) while achieving comparable few-shot generalization. OPTs leverage extensive data and tensor parallelism with high-memory GPUs (supporting large batch sizes), which are usually not easily accessible and can lead to exorbitant costs.

Noticeably different from the aforementioned works, we explore making PTLM training more efficient by utilizing highly informative subsets of the training data. Consequently, our proposal effectively complements other optimization methods that target aspects such as model architecture and hardware enhancements.

Figure 2: Ingenious framework for informative data subset selection to pre-train language models. We warm-start pre-training for W 𝑊 W italic_W steps to enable it to learn useful representations (step A 𝐴 A italic_A). Owing to the size of pre-training data, we divide the total number of samples (n) into P partitions (step B 1 subscript 𝐵 1 B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) followed by selecting instances according to submodular gains (step B 2 subscript 𝐵 2 B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) through probabilistic sampling (step C 𝐶 C italic_C). We obtain a subset (of total size k) of representative samples from each partition such that the subset is updated periodically (step D 𝐷 D italic_D) after R steps of training on selected subset.

3 The Ingenious Framework

We now present Ingenious - an informative data subset selection framework for pre-training language models. We summarize the training pipeline in Figure2. We first describe the notation to formulate the problem, followed by details of different steps involved in the framework.

3.1 Notation

We denote the unlabeled dataset for pre-training by 𝒰={x j}j=1 n 𝒰 superscript subscript subscript 𝑥 𝑗 𝑗 1 𝑛{\mathcal{U}}={x_{j}}{j=1}^{n}caligraphic_U = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, consisting of n 𝑛 n italic_n data points each corresponding to a varying length of sequence of symbols {s i}i=1 m superscript subscript subscript 𝑠 𝑖 𝑖 1 𝑚{s{i}}_{i=1}^{m}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (these symbols could be words or character sequences such as sub-words). Let 𝒮⊆𝒰 𝒮 𝒰{\mathcal{S}}\subseteq{\mathcal{U}}caligraphic_S ⊆ caligraphic_U be the subset of the unlabeled dataset on which the language model is trained. Let the language model be parameterized by 𝜽 𝜽{\bm{\theta}}bold_italic_θ. We subscript the changing variables such as model parameters 𝜽 𝜽{\bm{\theta}}bold_italic_θ, subset 𝒮 𝒮{\mathcal{S}}caligraphic_S with the timestep t 𝑡 t italic_t to denote their specific values at that timestep.

3.2 Problem Formulation

In its most general form, subset selection is defined as

𝒮 t=arg⁢max 𝒮⊆𝒰⁢f⁢(𝒮)subscript 𝒮 𝑡 𝒮 𝒰 arg max 𝑓 𝒮{\mathcal{S}}_{t}=\underset{{\mathcal{S}}\subseteq{\mathcal{U}}}{\operatorname% *{arg,max}\hskip 1.99168pt}f({\mathcal{S}})caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_UNDERACCENT caligraphic_S ⊆ caligraphic_U end_UNDERACCENT start_ARG start_OPERATOR roman_arg roman_max end_OPERATOR end_ARG italic_f ( caligraphic_S )(1)

where the subset 𝒮 t⊆𝒰 subscript 𝒮 𝑡 𝒰{\mathcal{S}}_{t}\subseteq{\mathcal{U}}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_U at step t 𝑡 t italic_t is selected such that it maximizes the function f 𝑓 f italic_f.

While the above general subset selection problem is NP-Hard, the problem becomes approximable in case the function f 𝑓 f italic_f is submodular in nature Fujishige (2005). A set function f:2 𝒰→ℝ:𝑓 absent→superscript 2 𝒰 ℝ f:2^{{\mathcal{U}}}\xrightarrow{}\mathbb{R}italic_f : 2 start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R is submodular if for x∈𝒰 𝑥 𝒰 x\in{\mathcal{U}}italic_x ∈ caligraphic_U, f⁢(𝒜∪x)−f⁢(𝒜)≥f⁢(ℬ∪x)−f⁢(ℬ)𝑓 𝒜 𝑥 𝑓 𝒜 𝑓 ℬ 𝑥 𝑓 ℬ f({\mathcal{A}}\cup x)-f({\mathcal{A}})\geq f({\mathcal{B}}\cup x)-f({\mathcal% {B}})italic_f ( caligraphic_A ∪ italic_x ) - italic_f ( caligraphic_A ) ≥ italic_f ( caligraphic_B ∪ italic_x ) - italic_f ( caligraphic_B ), ∀𝒜⊆ℬ⊆𝒰 for-all 𝒜 ℬ 𝒰\forall{\mathcal{A}}\subseteq{\mathcal{B}}\subseteq{\mathcal{U}}∀ caligraphic_A ⊆ caligraphic_B ⊆ caligraphic_U and x∉ℬ 𝑥 ℬ x\notin{\mathcal{B}}italic_x ∉ caligraphic_B. We pose the data subset selection problem as a submodular maximization problem since it allows for easier optimization by employing different approximations Nemhauser et al. (1978); Iyer and Bilmes (2019). In order to choose a suitable submodular function, one must understand the characteristics of the subsets that are crucial for the end-goal – efficient learning in our case. Previous works in computer vision have demonstrated that commonly used vision datasets contain many redundancies, and eliminating such redundant data samples does not affect the model’s performance(Birodkar et al., 2019; Toneva et al., 2019; Paul et al., 2021; Sorscher et al., 2022). Further, one can achieve faster model training by using highly informative and representative data subsets(Kaushal et al., 2019; Mirzasoleiman et al., 2020; Sorscher et al., 2022). Please refer to AppendixB for more related work on submodularity based subset selection. Building upon the learnings from computer vision research, our primary requirement for the selected subset is that it should faithfully represent the training data and have minimal redundancy within itself.

3.3 Overview of Approach

In order to select a representative subset as discussed above, we use Facility Location(Salhi, 1991; Krause and Golovin, 2014), a commonly-used submodular function closely related to k 𝑘 k italic_k-medoid clustering which is defined as

f F⁢L⁢(𝒜)=∑i∈𝒰⁢max j∈𝒜⁢𝒦 i⁢j subscript 𝑓 𝐹 𝐿 𝒜 𝑖 𝒰 𝑗 𝒜 subscript 𝒦 𝑖 𝑗 f_{FL}({\mathcal{A}})=\underset{i\in{{\mathcal{U}}}}{\sum},\underset{j\in{% \mathcal{A}}}{\max},{\mathcal{K}}_{ij}italic_f start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT ( caligraphic_A ) = start_UNDERACCENT italic_i ∈ caligraphic_U end_UNDERACCENT start_ARG ∑ end_ARG start_UNDERACCENT italic_j ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG caligraphic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT(2)

where 𝒜 𝒜{\mathcal{A}}caligraphic_A is the subset being evaluated, 𝒦 𝒦{\mathcal{K}}caligraphic_K is a pair-wise similarity matrix and 𝒦 i⁢j subscript 𝒦 𝑖 𝑗{\mathcal{K}}_{ij}caligraphic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the similarity between the i t⁢h superscript 𝑖 𝑡 ℎ i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and j t⁢h superscript 𝑗 𝑡 ℎ j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT samples. Thus, our subset selection problem can be represented as:

𝒮 t=arg⁢max 𝒮⊆𝒰:|𝒮|=k⁢f F⁢L⁢(𝒮)subscript 𝒮 𝑡:𝒮 𝒰 𝒮 𝑘 arg max subscript 𝑓 𝐹 𝐿 𝒮{\mathcal{S}}{t}=\underset{{\mathcal{S}}\subseteq{\mathcal{U}}:|{\mathcal{S}}% |=k}{\operatorname*{arg,max}\hskip 1.99168pt}f{FL}({\mathcal{S}})caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_UNDERACCENT caligraphic_S ⊆ caligraphic_U : | caligraphic_S | = italic_k end_UNDERACCENT start_ARG start_OPERATOR roman_arg roman_max end_OPERATOR end_ARG italic_f start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT ( caligraphic_S )(3)

Here, k 𝑘 k italic_k represents the size of the subset 𝒮 𝒮{\mathcal{S}}caligraphic_S. We would like to clarify that Equation(3) enables us to choose diverse samples such that each represents other samples in the corpus, instead of selection of similar samples. The optimization problem in Equation(3) is an instance of cardinality-constrained monotone submodular maximization for which an approximate solution can be obtained by incrementally building the subset from scratch using algorithms such as Naive Greedy(Nemhauser et al., 1978), Lazy Greedy (Minoux, 1978), Stochastic Greedy(Mirzasoleiman et al., 2015), Lazier-than Lazy-Greedy(Mirzasoleiman et al., 2015). We use the Lazier-than-Lazy Greedy optimizer as it is the most computationally efficient, along with memoization Iyer and Bilmes (2019).

The facility location function utilizes a pairwise similarity kernel 𝒦 𝒦{\mathcal{K}}caligraphic_K (of size |𝒰|×|𝒰|𝒰 𝒰|{\mathcal{U}}|\times|{\mathcal{U}}|| caligraphic_U | × | caligraphic_U |) between the data samples in 𝒰 𝒰{\mathcal{U}}caligraphic_U to select representative subsets. To estimate the kernel values, we compute the cosine similarity between the feature representations of data samples obtained using the LM itself. To ensure that the extracted representations are meaningful during the initial phase, we warm start the model for W training steps as suggested by Killamsetty et al. (2021a, c) (step A 𝐴 A italic_A in Figure2). Further, to ensure that LM sees diverse data samples, we probabilistically sample data points based on submodular ordering obtained from running the greedy algorithm(steps B 𝐵 B italic_B and C 𝐶 C italic_C in Figure2) and update the subset after every R t⁢h superscript 𝑅 𝑡 ℎ R^{th}italic_R start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT iteration (step D 𝐷 D italic_D in Figure2) .

This re-sampling procedure is repeated till the pre-determined number of steps. Algorithm1 summarises the steps involved and in the following section, we describe the details of each step.

Input:Training dataset:

𝒰 𝒰{\mathcal{U}}caligraphic_U , Initial model parameters:

θ 0 subscript 𝜃 0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Total no of training steps:

T 𝑇 T italic_T , Training steps interval for subset selection:

R 𝑅 R italic_R , Number of steps for warmstart phase:

W 𝑊 W italic_W , Size of the subset:

k 𝑘 k italic_k , Learning rates:

{α t}t=0 t=T−1 superscript subscript subscript 𝛼 𝑡 𝑡 0 𝑡 𝑇 1{\alpha_{t}}_{t=0}^{t=T-1}{ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T - 1 end_POSTSUPERSCRIPT

Set

t=0 𝑡 0 t=0 italic_t = 0 optimizer = AdamW() *** Warmstart Phase *** repeat

Compute batches

𝒰 b=((x b,y b);b∈(1⁢⋯⁢B))subscript 𝒰 𝑏 subscript 𝑥 𝑏 subscript 𝑦 𝑏 𝑏 1⋯𝐵{\mathcal{U}}{b}=((x{b},y_{b});b\in(1\cdots B))caligraphic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ( ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ; italic_b ∈ ( 1 ⋯ italic_B ) ) from

𝒰 𝒰{\mathcal{U}}caligraphic_U for b=1 𝑏 1 b=1 italic_b = 1 to B 𝐵 B italic_B do

if t≥W 𝑡 𝑊 t\geq W italic_t ≥ italic_W then

break

Compute mask

𝐦 t subscript 𝐦 𝑡{\mathbf{m}}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on

𝒰 b subscript 𝒰 𝑏{\mathcal{U}}{b}caligraphic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT θ t+1 subscript 𝜃 𝑡 1\theta{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = optimizer.step()

t=t+1 𝑡 𝑡 1 t=t+1 italic_t = italic_t + 1

until until t≥W 𝑡 𝑊 t\geq W italic_t ≥ italic_W*** Subset Selection *** greedyIdxs, gains =

a⁢r⁢g⁢m⁢a⁢x|S|≤|𝒰|⁢f F⁢L⁢(S,𝒰,θ t)𝑎 𝑟 𝑔 𝑚 𝑎 subscript 𝑥 𝑆 𝒰 subscript 𝑓 𝐹 𝐿 𝑆 𝒰 subscript 𝜃 𝑡 argmax_{|S|\leq|{\mathcal{U}}|}f_{FL}(S,{\mathcal{U}},\theta_{t})italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT | italic_S | ≤ | caligraphic_U | end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT ( italic_S , caligraphic_U , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) probabilities = TaylorSoftmax(gains)

𝒮 t∼similar-to subscript 𝒮 𝑡 absent{\mathcal{S}}_{t}\sim caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ sample(greedyIdxs, probabilities, k) repeat

Compute batches

𝒮 t⁢b=((x b,y b);b∈(1⁢⋯⁢B))subscript 𝒮 𝑡 𝑏 subscript 𝑥 𝑏 subscript 𝑦 𝑏 𝑏 1⋯𝐵{\mathcal{S}}{tb}=((x{b},y_{b});b\in(1\cdots B))caligraphic_S start_POSTSUBSCRIPT italic_t italic_b end_POSTSUBSCRIPT = ( ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ; italic_b ∈ ( 1 ⋯ italic_B ) ) from

𝒮 t subscript 𝒮 𝑡{\mathcal{S}}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for b=1 𝑏 1 b=1 italic_b = 1 to B 𝐵 B italic_B do

if t≥T 𝑡 𝑇 t\geq T italic_t ≥ italic_T then

break

Compute mask

𝐦 t subscript 𝐦 𝑡{\mathbf{m}}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on

𝒮 t⁢b subscript 𝒮 𝑡 𝑏{\mathcal{S}}{tb}caligraphic_S start_POSTSUBSCRIPT italic_t italic_b end_POSTSUBSCRIPT θ t+1=o⁢p⁢t⁢i⁢m⁢i⁢z⁢e⁢r.s⁢t⁢e⁢p⁢()formulae-sequence subscript 𝜃 𝑡 1 𝑜 𝑝 𝑡 𝑖 𝑚 𝑖 𝑧 𝑒 𝑟 𝑠 𝑡 𝑒 𝑝\theta{t+1}=optimizer.step()italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_o italic_p italic_t italic_i italic_m italic_i italic_z italic_e italic_r . italic_s italic_t italic_e italic_p ( )t=t+1 𝑡 𝑡 1 t=t+1 italic_t = italic_t + 1 if (t%R==0)(t%R==0)( italic_t % italic_R = = 0 ) then

𝒮 t+1∼similar-to subscript 𝒮 𝑡 1 absent{\mathcal{S}}_{t+1}\sim caligraphic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sample(greedyIdxs, probabilities, k) break

else

until until t≥T 𝑡 𝑇 t\geq T italic_t ≥ italic_T*** Evaluate trained model on validation set ***

e⁢v⁢a⁢l=evaluate⁡(θ T,𝒱)𝑒 𝑣 𝑎 𝑙 evaluate subscript 𝜃 𝑇 𝒱 eval=\operatorname{evaluate\hskip 1.99168pt}(\theta_{T},{\mathcal{V}})italic_e italic_v italic_a italic_l = start_OPFUNCTION roman_evaluate end_OPFUNCTION ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_V ) return e⁢v⁢a⁢l,𝛉 T 𝑒 𝑣 𝑎 𝑙 subscript 𝛉 𝑇 eval,{\bm{\theta}}_{T}italic_e italic_v italic_a italic_l , bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

Algorithm 1 Pre-Training using Ingenious

3.4 Methodology Details

Feature Encoders for Similarity Computation: The selection of optimal representative subsets requires a similarity kernel that captures the intrinsic relationships between data samples. We explore dense and sparse feature encoders for obtaining the feature representation of text samples in 𝒰 𝒰{\mathcal{U}}caligraphic_U. As a dense feature encoder for text samples, we use the intermediate representations as obtained from the LM that is currently being trained. We compute the representations of an input sequence by averaging the output embeddings of the constituent tokens. A question then arises on which layer of the underlying model should be used for obtaining this representation since different layers encode different types of information (Rogers et al., 2020). Another possibility is to use sparse representations such as TF-IDF(Aizawa, 2003) owing to its success at capturing statistically important lexical features(Robertson et al., 2009). We study the effect of using sparse feature representations (i.e., TF-IDF) and dense feature representations obtained from different layers of LM in Section4.3. Our experiments revealed that dense feature encoders yield the best results.

Submodular Greedy Ordering based Data Selection: After deciding on the choice of similarity kernel, we now describe how to select the subsets (steps B 𝐵 B italic_B and C 𝐶 C italic_C in Figure2) as defined by Equation(3). Approximate submodular maximization algorithms such as LazierthanLazy Greedy start with an empty subset and incrementally add data points one by one till the size of the subset equals the budget k 𝑘 k italic_k set by us. If 𝒮 𝒮{\mathcal{S}}caligraphic_S represents subset selected so far, and e 𝑒 e italic_e represents the next locally optimal data sample to be added, the submodular gain value of e 𝑒 e italic_e is defined as f⁢(𝒮∪e)−f⁢(𝒮)𝑓 𝒮 𝑒 𝑓 𝒮 f({\mathcal{S}}\cup e)-f({\mathcal{S}})italic_f ( caligraphic_S ∪ italic_e ) - italic_f ( caligraphic_S ). While running the algorithm, we initially set the budget as the size of the entire data(say M 𝑀 M italic_M) in order to obtain and store the submodular gain (step B 2 subscript 𝐵 2 B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Figure2) of each data sample at the time of their addition.

The key idea here is to use the submodular gains associated with each data sample as an importance score; convert them to a probability distribution by using the second order Taylor-softmax operation de Brébisson and Vincent (2016) (step C 𝐶 C italic_C in Figure2) and then sample a subset of desired size(say k 𝑘 k italic_k) from the above distribution. Given gains vector {g 1,g 2,⋯,g M}subscript 𝑔 1 subscript 𝑔 2⋯subscript 𝑔 𝑀{g_{1},g_{2},\cdots,g_{M}}{ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_g start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, Taylor-softmax operation over the vector for converting it to probability distribution P 𝑃 P italic_P can be specified as P⁢=def⁢{1+g i+0.5⁢g i 2∑j=1 M 1+g j+0.5⁢g j 2}i=1 M 𝑃 def superscript subscript 1 subscript 𝑔 𝑖 0.5 superscript subscript 𝑔 𝑖 2 superscript subscript 𝑗 1 𝑀 1 subscript 𝑔 𝑗 0.5 superscript subscript 𝑔 𝑗 2 𝑖 1 𝑀 P\overset{\operatorname{def}}{=}\Big{{}\frac{1+g_{i}+0.5g_{i}^{2}}{\sum_{j=1}% ^{M}1+g_{j}+0.5g_{j}^{2}}\Big{}}_{i=1}^{M}italic_P overroman_def start_ARG = end_ARG { divide start_ARG 1 + italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 0.5 italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT 1 + italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 0.5 italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT.

Using the probability distribution P 𝑃 P italic_P for sampling ensures that samples which have high importance score associated with them are selected with greater probability. However, it also allows the LM to explore the samples with low importance score during training to prevent overfitting. We reuse this probability distribution to sample new subsets of size k 𝑘 k italic_k every R 𝑅 R italic_R steps by sampling k 𝑘 k italic_k points without replacement (step D 𝐷 D italic_D in Figure2).

Recall that we require a similarity kernel of size |𝒰|×|𝒰|𝒰 𝒰|{\mathcal{U}}|\times|{\mathcal{U}}|| caligraphic_U | × | caligraphic_U |, hence the memory required for storing the similarity kernels is practically infeasible. We now describe how we scale Ingenious to handle size of the pre-training datasets used for LMs.

Partitioning based Efficient Subset Selection: To minimize the memory consumption, instead of constructing a probability distribution over the entire unlabeled set directly, we first partition (step B 1 subscript 𝐵 1 B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Figure2) the unlabeled set into N P subscript 𝑁 𝑃 N_{P}italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT random blocks of equal sizes (i.e., partition size is |𝒰|N P 𝒰 subscript 𝑁 𝑃\frac{|{\mathcal{U}}|}{N_{P}}divide start_ARG | caligraphic_U | end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG) and construct a probability distribution P i subscript 𝑃 𝑖 P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over each data block 𝒰 i p:|𝒰 i p|=|𝒰|N P:superscript subscript 𝒰 𝑖 𝑝 subscript superscript 𝒰 𝑝 𝑖 𝒰 subscript 𝑁 𝑃{\mathcal{U}}{i}^{p}:|{\mathcal{U}}^{p}{i}|=\frac{|{\mathcal{U}}|}{N_{P}}caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT : | caligraphic_U start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = divide start_ARG | caligraphic_U | end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG. We then use the constructed probability distributions P i subscript 𝑃 𝑖 P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over each data block 𝒰 i p superscript subscript 𝒰 𝑖 𝑝{\mathcal{U}}{i}^{p}caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to sample a subset of size k/N P 𝑘 subscript 𝑁 𝑃 k/{N{P}}italic_k / italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT from the data block without replacement. We compute the final subset using subsets from each partition as follows:

𝒮 t=⋃i=1 N P⁢sample⁡(𝒰 i p,P i,k N P)subscript 𝒮 𝑡 subscript 𝑁 𝑃 𝑖 1 sample subscript superscript 𝒰 𝑝 𝑖 subscript 𝑃 𝑖 𝑘 subscript 𝑁 𝑃{\mathcal{S}}{t}=\overset{N{P}}{\underset{i=1}{\bigcup}}\operatorname{sample% \hskip 1.99168pt}({\mathcal{U}}^{p}{i},P{i},\frac{k}{N_{P}})caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_OVERACCENT italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_OVERACCENT start_ARG start_UNDERACCENT italic_i = 1 end_UNDERACCENT start_ARG ⋃ end_ARG end_ARG start_OPFUNCTION roman_sample end_OPFUNCTION ( caligraphic_U start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , divide start_ARG italic_k end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG )(4)

The partitioning of the unlabeled set allows us to get away with constructing similarity kernels of size |𝒰|N P×|𝒰|N P 𝒰 subscript 𝑁 𝑃 𝒰 subscript 𝑁 𝑃\frac{|{\mathcal{U}}|}{N_{P}}\times\frac{|{\mathcal{U}}|}{N_{P}}divide start_ARG | caligraphic_U | end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG × divide start_ARG | caligraphic_U | end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG, thereby reducing the similarity kernel memory usage by around N P 2 superscript subscript 𝑁 𝑃 2{N_{P}}^{2}italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times. We discuss the effect of the partition size in AppendixK. In order to maximize the utilization of available resources, we can construct probability distributions over each block in the data partition in parallel. As in recent work Mittal et al. (2022), partitioned facility location can be shown as a lower bound of the original objective function,i.e., facility location that is being maximized. It should be noted that memory utilization also increases with the number of parallel processes. For example, when N P⁢P subscript 𝑁 𝑃 𝑃 N_{PP}italic_N start_POSTSUBSCRIPT italic_P italic_P end_POSTSUBSCRIPT subsets are selected from partitions in parallel, the memory usage due to similarity kernel is of the order 𝒪⁢(N P⁢P⁢|𝒰|2 N P 2)𝒪 subscript 𝑁 𝑃 𝑃 superscript 𝒰 2 superscript subscript 𝑁 𝑃 2{\mathcal{O}}(N_{PP}{\frac{|{\mathcal{U}}|^{2}}{N_{P}^{2}}})caligraphic_O ( italic_N start_POSTSUBSCRIPT italic_P italic_P end_POSTSUBSCRIPT divide start_ARG | caligraphic_U | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). In our experiments, we set N P⁢P=100 subscript 𝑁 𝑃 𝑃 100 N_{PP}=100 italic_N start_POSTSUBSCRIPT italic_P italic_P end_POSTSUBSCRIPT = 100 processes.

4 Experiments and Results

We use BERT(Devlin et al., 2019), GPT-2(Radford et al., 2019) and a domain-specific version of BERT - BioBERT(Lee et al., 2020) as the underlying LMs. Specifically, we use BERT-Base(110M) and GPT2-Small(124M). For BERT, we use English Wikipedia in conjunction with BooksCorpus as the pre-training corpora and employ MLM and NSP tasks for pre-training following details in the work of Devlin et al. (2019). We perform pre-training using a batch size of 1024 for 1,000,000 steps in the case of vanilla-BERT. We perform ablations over data subset sizes and number of pre-training steps for Ingenious enabled pre-training and find a subset size of 25%percent 25 25%25 % (AppendixJ) with 250,000 pre-training steps (25%) as an optimal choice. We set the value of R to 25000 steps. We refer the reader to AppendixG for further implementation details. For Ingenious enabled pre-training of BioBERT and GPT-2, we discuss the implementation details and experimental results in Sections4.4 and 4.5, respectively.

4.1 Ingenious for BERT Pre-training

Table 1: Comparison of Ingenious with vanilla pre-training (full 1M steps) and other baselines for BERT. We report fine-tuning performance on GLUE benchmark and CoLA task in GLUE averaged over 20 runs. Statistically significant improvements(as measured by one-tailed t-test with 99% significance level) over baselines B1, B2, B3 are indicated by superscripts 1,2,3 respectively. Numbers in brackets denote difference relative to vanilla variant. We report metrics for Ingenious and baselines after 250K pre-training steps. Please refer to AppendixF for task-wise scores and AppendixE for validation set losses during the course of pretraining

We consider two leagues of pre-trained models, viz., (i) BERT pre-trained on subsets selected through Ingenious and (ii) vanilla BERT pre-trained fully up to 1 million steps. We contrast these by fine-tuning each on the commonly used GLUE benchmark(Wang et al., 2019) and report the performances of each. Further, we compare Ingenious against three baselines - B1) Random Selection: which is obtained by pre-training BERT on a randomly sampled subset of the same size as that selected by Ingenious; B2) Early Stopping: BERT pre-training stopped at 250K steps as checkpoint for evaluation; B3) Loss Based Sampling Loshchilov and Hutter (2016): which is obtained by pre-training BERT on a subset, of the same size as those selected by Ingenious, sampled from a probability distribution that is constructed by ranking the losses in descending order and allocating the high rank (high loss) samples greater probability than low rank (low loss) samples. We would like to emphasise that we choose the baselines B1 and B3 owing to their relevance to making LM pre-training efficient with respect to data optimization. Table1 reports the GLUE score averaged over 20 runs on the dev sets obtained after 250K pre-training steps for the pre-trained models obtained by different methods.

Figure 3: Comparison of Ingenious with vanilla BERT on GLUE performance vs. pre-training steps (top) and cost (bottom) using checkpoints obtained at intermediate pre-training stages.

We observe that despite using only a subset of training data and being trained only for 250K steps, Ingenious achieves 98.6%percent 98.6 98.6%98.6 % performance of the vanilla fully pre-trained BERT. Further, Ingenious achieves statistically significant improvements over the three baselines (B1, B2, and B3). Ingenious also outperforms baseline B3, which prioritizes training the BERT model on samples with a high loss rate. Prioritizing high-loss samples may likely result in overfitting, which may explain the poor fine-tuning performance of baseline B3 on GLUE tasks compared to baseline B2. Therefore, Ingenious selects informative subsets that not only help improve BERT pre-training convergence but also help retain its generalization capabilities. Further, we observe that extended training of Ingenious till 400K steps yields 99.1%percent 99.1 99.1%99.1 % performance of the vanilla BERT. We would like to highlight that most of the downstream task performance achieved by an PTLM is due to the initial stages of pre-training with most of the later pre-training resulting in up to ∼1%similar-to absent percent 1\sim 1%∼ 1 % improvement(Smith et al., 2022). In this context, Ingenious helps in achieving later-stage performance gains relatively earlier. Finally, we would like to highlight that Ingenious performs significantly better compared to the baselines on the CoLA task (in Table1) which is deemed to be most difficult Geiping and Goldstein (2022) in the GLUE benchmark. This implies that the subsets selected by Ingenious are able to capture the important and highly informative signals from the underlying data resulting in robust performance on challenging tasks as well.

Further, to compare different methods at different stages of pre-training, we obtain corresponding checkpoints and fine-tune on GLUE tasks. For this particular setting, we present a comparison of vanilla BERT pre-training against Ingenious in Figure3. We plot the downstream performance for all the methods and it can be seen that Ingenious shows better performance than all the baselines at 250K steps of pre-training and thereafter, beyond 250K steps, the trend continues consistently (Figure3 - top). Also, pre-training through informative subsets enables BERT to achieve a performance level at 250K steps which the vanilla pre-training achieves only after over 350K iterations. Similarly, for any given pre-training cost, Ingenious yields a better GLUE score than the baselines (Figure3 - bottom). Further we observe that Ingenious consistenly outperforms the baselines even when extended to 1 million steps(maximum number of training steps prescribed by Devlin et al. (2019) for vanilla BERT pre-training) as shown in Figure4.

Figure 4: Ingenious is found to outperform the baselines even on extended training till 1M steps

Effectiveness of Importance Sampling: We also evaluated a variant where the samples are selected greedily based on submodular ranking instead of importance sampling over submodular gains. In contrast to the 81.57 achieved by Ingenious, it achieved an Avg. GLUE score of 80.5 after 250K pre-training steps, highlighting the effectiveness of importance sampling.

4.2 Knowledge Retention with Ingenious

Large PTLMs, when trained on a sufficiently large corpus, stores various types of knowledge implicitly in their parameters AlKhamissi et al. (2022). Since Ingenious uses only a subset of the whole data for pre-training, it is natural for it to contain lesser knowledge in its parameters but how does it compare with vanilla BERT pretraining and other baseline when it comes to knowledge retention? To answer this question, we use LAMA benchmark(Petroni et al., 2019), a probe designed to analyze factual knowledge present in PTLMs. LAMA is derived from four distinct types of knowledge sources - Google-RE, T-REx, ConceptNet, and SQuAD – from which cloze sentences are created using facts contained in the respective knowledge sources. The PTLM has to predict the fact tokens in place of the mask tokens in cloze sentences. In Table2, we summarize the results. We note that Ingenious suffers minimal loss in knowledge retention with respect to fully pre-trained vanilla BERT on all tasks. Further, the decrease in performance is less as compared to the baselines (for most tasks) which suffer a more severe decrease in performance. Intuitively, we attribute this to the ability of Ingenious to select highly informative subsets from the corpus while excluding the redundant information.

Table 2: Knowledge retention of different models as measured by LAMA probe. We report P@1 scores for all the four different subtasks in LAMA.

4.3 Effect of Embedding Representations

Table 3: Ablation study by varying embedding representation for selecting subsets. We report mean GLUE score to compare Ingenious variants.

Different BERT layers have been shown to capture different information - lower layers capture word order(Rogers et al., 2020), middle capture syntactic information(Hewitt and Manning, 2019; Jawahar et al., 2019) and the later layers capture task-specific information(Kovaleva et al., 2019; Hao et al., 2019). We vary the layers - (3, 6, 9 and 12) used to obtain features for subset selection and report the performance on GLUE in Table3. We observe that layer 9 features yield the best results. Further, in Table3, we compare the effect of using TF-IDF as sample representations and contrast them against dense features (BERT Layer-9). We observe that dense embeddings perform better than shallow TF-IDF features. We also report effect of subset size and number of partitions in AppendicesJ and K.

(a) BC2GM(Smith et al., 2008) Dataset

(b) ChemProt(Taboureau et al., 2011) Dataset

Figure 5: Plots (a) and (b) are the convergence results comparing Avg. F1 score (over three runs) with the Wall clock time for Vanilla BioBERT and BioBERT using Ingenious with 25% subset. We observe that Ingenious achieves much faster convergence than vanilla BioBERT(i.e., Full Training).

4.4 Ingenious for Domain-Specific PTLM - BioBERT

We evaluate the performance of Bio-BERT(Lee et al., 2020) pre-trained on subsets selected through Ingenious and compare it with vanilla Bio-BERT by fine-tuning it on biomedical datasets for the Named Entity Recognition (NER) and Relation Extraction (RE) tasks. For vanilla Bio-BERT, we start with a pre-trained BERT model and further pre-train it on the PubMed abstracts dataset for 200,000 steps(as recommended by the original study). Please refer to AppendixI for further implementation details. We present the performance convergence plots of vanilla Bio-BERT vs. training time using Ingenious with a subset size of 25% in Figure5. It shows that during initial stages of pre-training, Ingenious performs similar to vanilla since the LM is still learning representations, however once better representations for subset selection are learned, Ingenious achieves faster convergence than vanilla w.r.t pre-training time and achieves the best accuracy around 1.4x faster.

(a) GLUE(Wang et al., 2019) Benchmark

(b) BBQ Lite(Srivastava et al., 2022) Dataset

Figure 6: Comparison of Ingenious with vanilla GPT-2 pre-training at different pre-training stages. Pre-training on Ingenious subsets enables GPT-2 to achieve better GLUE score consistently.

4.5 Ingenious for GPT-2 Pre-training

We also pre-train GPT-2(Radford et al., 2019) using Ingenious. We estimate the mean accuracy for GLUE fine-tuning (averaged over 20 runs) and zero-shot accuracy on BBQ Lite generative task. Please refer to AppendixH for implementation details. We plot the performance (see Figure6) obtained for the above benchmarks against checkpoints at different pre-training stages (steps). Figure6 - left and right shows that Ingenious performs consistently better than vanilla GPT-2 pre-training on GLUE and BBQ Lite respectively at different stages of pre-training indicating better convergence.

5 Conclusions

We presented INGENIOUS, a framework for efficient pre-training of language models using highly informative data subsets, and presented a submodular optimization based algorithm. We described how it can be scaled for language models and showed its effectiveness using rigorous empirical evaluation. Our future work will explore exploiting external knowledge bases to identify and reduce redundancies in the corpus and to study multi-modal training where redundant information can be spread across different modalities.

6 Limitations

In terms of limitations, the submodular maximization based on estimation of pairwise sample similarity can be potentially constrained by memory limitations and might require high CPU RAM capacity. Further, we do acknowledge that our experiments are performed on relatively smaller PTLMs compared to GPT-3, OPT or PaLM owing to resource limitations. We have tried our best to perform extensive experiments and perform ablation studies to inform our design choices within our resource constraints.

7 Ethical Considerations

We believe that Ingenious has a significant positive impact on society since it makes pre-training of LMs compute efficient, thereby reducing CO2 emissions and energy costs. Nonetheless, the Ingenious framework is susceptible to biases and toxic words within the pre-training corpora as it relies on standard pre-training datasets. An exciting future direction of this research is to investigate whether we could use targeted subset selection to filter out toxic words, as well as phrases that promote cultural stereotypes and biases from the pre-training corpora before LM pre-training.

8 Acknowledgments and Disclosure of Funding

This work is supported by an Adobe Data Science Award, and by the National Science Foundation under Grant No. IIS-2106937 awarded to Rishabh Iyer. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Adobe or the National Science Foundation. Ganesh Ramakrishnan is grateful to the Adobe Award and Institute Chair Professorship Award at IIT Bombay for supporting this work.

References

Aizawa (2003) Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65.
AlKhamissi et al. (2022) Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases.
Bach (2013) Francis Bach. 2013. Learning with submodular functions: A convex optimization perspective. Foundations and Trends® in Machine Learning, 6(2-3):145–373.
Bach (2019) Francis Bach. 2019. Submodular functions: from discrete to continuous domains. Mathematical Programming, 175(1):419–459.
Bilmes (2022) Jeff Bilmes. 2022. Submodularity in machine learning and artificial intelligence.
Birodkar et al. (2019) Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. 2019. Semantic redundancies in image-classification datasets: The 10% you don’t need. ArXiv, abs/1901.11409.
Bitton et al. (2021) Yonatan Bitton, Michael Elhadad, Gabriel Stanovsky, and Roy Schwartz. 2021. Data efficient masked language modeling for vision and language. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3013–3028, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
Campbell and Broderick (2018) Trevor Campbell and Tamara Broderick. 2018. Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, pages 698–706.
Chen et al. (2020) Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. The lottery ticket hypothesis for pre-trained bert networks. In Advances in Neural Information Processing Systems, volume 33, pages 15834–15846. Curran Associates, Inc.
de Brébisson and Vincent (2016) Alexandre de Brébisson and Pascal Vincent. 2016. An exploration of softmax alternatives belonging to the spherical loss family. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Feldman (2020) Dan Feldman. 2020. Core-Sets: Updated Survey, pages 23–44. Springer International Publishing, Cham.
Fujishige (2005) Satoru Fujishige. 2005. Submodular functions and optimization. Elsevier.
Geiping and Goldstein (2022) Jonas Geiping and Tom Goldstein. 2022. Cramming: Training a language model on a single gpu in one day. arXiv preprint arXiv:2212.14034.
Gordon et al. (2020) Mitchell Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 143–155, Online. Association for Computational Linguistics.
Hao et al. (2019) Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and understanding the effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China. Association for Computational Linguistics.
Har-Peled and Mazumdar (2004) Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300.
Hewitt and Manning (2019) John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
Iyer and Bilmes (2019) Rishabh Iyer and Jeffrey Bilmes. 2019. A memoization framework for scaling submodular optimization to large scale problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2340–2349. PMLR.
Iyer et al. (2021) Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. 2021. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pages 722–754. PMLR.
Iyer (2015) Rishabh Krishnan Iyer. 2015. Submodular optimization and machine learning: Theoretical results, unifying and scalable algorithms, and applications. Ph.D. thesis.
Izsak et al. (2021) Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Jawahar et al. (2019) Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
Jiao et al. (2020) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
Kahn (2021) Jeremy Kahn. 2021. A.I.’s carbon footprint is big, but easy to reduce, Google researchers say. https://fortune.com/2021/04/21/ai-carbon-footprint-reduce-environmental-impact-of-tech-google-research-study/. [Online; accessed 19-September-2022].
Karanam et al. (2022) Athresh Karanam, Krishnateja Killamsetty, Harsha Kokel, and Rishabh K Iyer. 2022. ORIENT: Submodular mutual information measures for data subset selection under distribution shift. In Thirty-Sixth Conference on Neural Information Processing Systems.
Kaur et al. (2022) Jivat Kaur, Sumit Bhatia, Milan Aggarwal, Rachit Bansal, and Balaji Krishnamurthy. 2022. LM-CORE: Language models with contextually relevant external knowledge. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 750–769, Seattle, United States. Association for Computational Linguistics.
Kaushal et al. (2019) Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan. 2019. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE.
Kaushal et al. (2022) Vishal Kaushal, Ganesh Ramakrishnan, and Rishabh Iyer. 2022. Submodlib: A submodular optimization library. arXiv preprint arXiv:2202.10680.
Killamsetty et al. (2022) Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti Lnu, Ganesh Ramakrishnan, Alexandre V. Evfimievski, Lucian Popa, and Rishabh K Iyer. 2022. AUTOMATA: Gradient based data subset selection for compute-efficient hyper-parameter tuning. In Thirty-Sixth Conference on Neural Information Processing Systems.
Killamsetty et al. (2021a) Krishnateja Killamsetty, Durga S, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. 2021a. Grad-match: Gradient matching based data subset selection for efficient deep model training. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5464–5474. PMLR.
Killamsetty et al. (2021b) Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. 2021b. Glister: Generalization based data subset selection for efficient and robust learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8110–8118.
Killamsetty et al. (2021c) Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh K Iyer. 2021c. RETRIEVE: Coreset selection for efficient and robust semi-supervised learning. In Advances in Neural Information Processing Systems.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kirchhoff and Bilmes (2014) Katrin Kirchhoff and Jeff Bilmes. 2014. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 131–141.
Kothawade et al. (2021) Suraj Kothawade, Nathan Beck, Krishnateja Killamsetty, and Rishabh Iyer. 2021. Similar: Submodular information measures based active learning in realistic scenarios. In Advances in Neural Information Processing Systems, volume 34, pages 18685–18697. Curran Associates, Inc.
Kothawade et al. (2020) Suraj Kothawade, Jiten Girdhar, Chandrashekhar Lavania, and Rishabh Iyer. 2020. Deep submodular networks for extractive data summarization. arXiv preprint arXiv:2010.08593.
Kovaleva et al. (2019) Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics.
Krause and Golovin (2014) Andreas Krause and Daniel Golovin. 2014. Submodular function maximization. Tractability, 3:71–104.
Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Li et al. (2022) Conglong Li, Minjia Zhang, and Yuxiong He. 2022. Curriculum learning: A regularization method for efficient and stable billion-scale GPT model pre-training.
Loshchilov and Hutter (2016) I Loshchilov and F Hutter. 2016. Online batch selection for faster training of neural networks. In International Conference on Learning Representations (ICLR) 2016 Workshop Track.
Maheshwari et al. (2020) Ayush Maheshwari, Oishik Chatterjee, Krishnateja Killamsetty, Ganesh Ramakrishnan, and Rishabh Iyer. 2020. Semi-supervised data programming with subset selection. arXiv preprint arXiv:2008.09887.
Minoux (1978) Michel Minoux. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pages 234–243. Springer.
Mirzasoleiman et al. (2015) Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. 2015. Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
Mirzasoleiman et al. (2020) Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for data-efficient training of machine learning models. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6950–6960. PMLR.
Mittal et al. (2022) Ashish Mittal, Durga Sivasubramanian, Rishabh Iyer, Preethi Jyothi, and Ganesh Ramakrishnan. 2022. Partitioned gradient matching-based data subset selection for compute-efficient robust asr training. In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022, Abu Dabhi.
Muhamed et al. (2021) Aashiq Muhamed, Iman Keivanloo, Sujan Perera, James Mracek, Yi Xu, Qingjun Cui, Santosh Rajagopalan, Belinda Zeng, and Trishul Chilimbi. 2021. Ctr-bert: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
Nagatsuka et al. (2021) Koichi Nagatsuka, Clifford Broni-Bediako, and Masayasu Atsumi. 2021. Pre-training a BERT with curriculum learning by increasing block-size of input text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 989–996, Held Online. INCOMA Ltd.
Nemhauser et al. (1978) George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294.
Paul et al. (2021) Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, volume 34, pages 20596–20607. Curran Associates, Inc.
Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases?In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
Pooladzandi et al. (2022) Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. 2022. Adaptive second order coresets for data-efficient machine learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17848–17869. PMLR.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
Rogers et al. (2020) Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
Salhi (1991) Saïd Salhi. 1991. Discrete location theory. Journal of the Operational Research Society, 42:1124–1125.
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Schick and Schütze (2021) Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
Schwartz et al. (2020) Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green ai. Communications of the ACM, 63(12):54–63.
Sharir et al. (2020) Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900.
Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew E. Peters, and Iz Beltagy. 2022. Staged training for transformer language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19893–19908. PMLR.
Smith et al. (2008) Larry Smith, Lorraine K. Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo, I.-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M. Friedrich, Kuzman Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A. Struble, Richard J. Povinelli, Andreas Vlachos, William A. Baumgartner, Lawrence Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafael Torres, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López, Jacinto Mata, and W.John Wilbur. 2008. Overview of BioCreative II gene mention recognition. Genome Biology, 9 Suppl 2:S2.
Smith et al. (2022) Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
Sorscher et al. (2022) Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. 2022. Beyond neural scaling laws: beating power law scaling via data pruning. In Thirty-Sixth Conference on Neural Information Processing Systems.
Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Taboureau et al. (2011) Olivier Taboureau, Sonny Kim Nielsen, Karine Audouze, Nils Weinhold, Daniel Edsgärd, Francisco S. Roque, Irene Kouskoumvekaki, Alina Bora, Ramona Curpan, Thomas Skøt Jensen, Søren Brunak, and Tudor I. Oprea. 2011. ChemProt: a disease chemical biology database. Nucleic Acids Research, 39(Database issue):D367–372.
Tohidi et al. (2020) Ehsan Tohidi, Rouhollah Amiri, Mario Coutino, David Gesbert, Geert Leus, and Amin Karbasi. 2020. Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 37(5):120–133.
Toneva et al. (2019) Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2019. An empirical study of example forgetting during deep neural network learning. In ICLR.
Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
Wei et al. (2015) Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963. PMLR.
Wei et al. (2014a) Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, and Jeff Bilmes. 2014a. Submodular subset selection for large-scale speech training data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3311–3315. IEEE.
Wei et al. (2014b) Kai Wei, Yuzong Liu, Katrin Kirchhoff, and Jeff Bilmes. 2014b. Unsupervised submodular subset selection for speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4107–4111. IEEE.
Yang et al. (2020) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. Xlnet: Generalized autoregressive pretraining for language understanding.
Yao et al. (2022) Xingcheng Yao, Yanan Zheng, Xiaocong Yang, and Zhilin Yang. 2022. NLP from scratch without large-scale pretraining: A simple and efficient framework. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25438–25451. PMLR.
Zafrir et al. (2021) Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. 2021. Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.

APPENDIX

Appendix A Code, Software, and Licenses

The data and code for Ingenious is available at the following url: https://github.com/Efficient-AI/ingenious. We release the code repositories of Ingenious with an MIT license, which is available for everybody to use freely.

All the code is developed using open-source HuggingFace for training LMs with PyTorch as the underlying framework. PyTorch is available under the BSD license. HuggingFace is available under Apache 2.0 license. For submodular optimization, we use a library called SUBMODLIB (Kaushal et al., 2022), which is freely available at https://github.com/decile-team/submodlib which is available under the MIT license.

Appendix B Additional Background and Related Work

Submodular Functions: Let 𝒰 𝒰{\mathcal{U}}caligraphic_U denote the unlabeled set of n 𝑛 n italic_n data points 𝒰={1,2,3,…,n}𝒰 1 2 3…𝑛{\mathcal{U}}={1,2,3,...,n}caligraphic_U = { 1 , 2 , 3 , … , italic_n } and a set function f:2 𝒰→ℝ:𝑓 absent→superscript 2 𝒰 ℝ f:2^{{\mathcal{U}}}\xrightarrow{}\mathbb{R}italic_f : 2 start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R. Formally, a function f 𝑓 f italic_f is submodular (Fujishige, 2005; Bilmes, 2022) if for x∈𝒰 𝑥 𝒰 x\in{\mathcal{U}}italic_x ∈ caligraphic_U, f⁢(𝒜∪x)−f⁢(𝒜)≥f⁢(ℬ∪x)−f⁢(ℬ)𝑓 𝒜 𝑥 𝑓 𝒜 𝑓 ℬ 𝑥 𝑓 ℬ f({\mathcal{A}}\cup x)-f({\mathcal{A}})\geq f({\mathcal{B}}\cup x)-f({\mathcal% {B}})italic_f ( caligraphic_A ∪ italic_x ) - italic_f ( caligraphic_A ) ≥ italic_f ( caligraphic_B ∪ italic_x ) - italic_f ( caligraphic_B ), ∀𝒜⊆ℬ⊆𝒰 for-all 𝒜 ℬ 𝒰\forall{\mathcal{A}}\subseteq{\mathcal{B}}\subseteq{\mathcal{U}}∀ caligraphic_A ⊆ caligraphic_B ⊆ caligraphic_U and x∉ℬ 𝑥 ℬ x\notin{\mathcal{B}}italic_x ∉ caligraphic_B. For a set 𝒜⊆𝒰 𝒜 𝒰{\mathcal{A}}\subseteq{\mathcal{U}}caligraphic_A ⊆ caligraphic_U, f⁢(𝒜)𝑓 𝒜 f({\mathcal{A}})italic_f ( caligraphic_A ) provides a real-valued score for 𝒜 𝒜{\mathcal{A}}caligraphic_A. A function f 𝑓 f italic_f is said to be monotone if f⁢(𝒜)≤f⁢(ℬ)𝑓 𝒜 𝑓 ℬ f({\mathcal{A}})\leq f({\mathcal{B}})italic_f ( caligraphic_A ) ≤ italic_f ( caligraphic_B ) whenever 𝒜⊆ℬ 𝒜 ℬ{\mathcal{A}}\subseteq{\mathcal{B}}caligraphic_A ⊆ caligraphic_B. Further, f is supermodular if −f 𝑓-f- italic_f is submodular, modular if it is both, and normalized if f⁢(ϕ)=0 𝑓 italic-ϕ 0 f(\phi)=0 italic_f ( italic_ϕ ) = 0. Submodularity occurs naturally in various real-world applications(Tohidi et al., 2020; Bach, 2013, 2019; Iyer, 2015) and a number of combinatorial functions, such as facility location, set cover, log determinant, graph cut, etc.(Iyer et al., 2021; Iyer and Bilmes, 2019; Kothawade et al., 2020, 2021; Karanam et al., 2022) are inherently submodular in nature. Submodularity is particularly attractive due to the constant factor 1−1 e 1 1 𝑒 1-\frac{1}{e}1 - divide start_ARG 1 end_ARG start_ARG italic_e end_ARG(Nemhauser et al., 1978) approximation for cardinality-constrained submodular maximization, allowing us to solve various combinatorial optimization problems, which are often NP-Hard in nature. Several recent works(Wei et al., 2014a, 2015; Mirzasoleiman et al., 2020; Killamsetty et al., 2021b, a, c, 2022; Kothawade et al., 2021; Karanam et al., 2022) have formulated the subset selection objective as a submodular maximization problem. Furthermore, variants of the greedy algorithm(Mirzasoleiman et al., 2015; Iyer and Bilmes, 2019) that can maximize a submodular function in near-linear time have been proposed.

Submodular Data Subset Selection: Submodular optimization has been successfully employed for data subset selection in various applications such as speech recognition(Wei et al., 2014b, a; Mittal et al., 2022), machine translation(Kirchhoff and Bilmes, 2014), active-learning(Wei et al., 2015; Kothawade et al., 2021), efficient deep learning(Kaushal et al., 2019; Killamsetty et al., 2022; Pooladzandi et al., 2022). Another active area of research is selecting representative subsets of data, also known as coresets(Feldman, 2020). A coreset is a weighted subset of data closely approximating certain desirable properties of the entire dataset (e.g., the loss function)(Feldman, 2020). Coreset selection has been shown to benefit a host of geometric problems such as k 𝑘 k italic_k-means and k 𝑘 k italic_k-median clustering(Har-Peled and Mazumdar, 2004) and, in recent times, has been used successfully for efficient bayesian inference(Campbell and Broderick, 2018) and improving training efficiency(Mirzasoleiman et al., 2020; Killamsetty et al., 2021a). Such informative data subset selection has shown remarkable promise for efficient and robust training of deep models(Killamsetty et al., 2021b, c). We direct the reader to a survey by Bilmes (2022) for a detailed review of submodularity and subset selection for ML.

Table 4: Comparison of pre-training cost and fine-tuning performance on GLUE tasks (averaged over 20 runs) for BERT. We report difference relative to full pre-training of vanilla BERT in brackets for cost and avg. GLUE score. Ingenious achieves 98.6%percent 98.6 98.6%98.6 % of fully pre-trained BERT performance, reducing pre-training cost to ∼28%similar-to absent percent 28\sim 28%∼ 28 %.

Table 5: Ablation study by varying subset size of selected subsets. We report mean GLUE score to compare Ingenious variants.

Table 6: Ablation study by varying partition size used during pre-training. We report mean GLUE score to compare Ingenious variants.

Appendix C Datasets

For pre-training BERT, we use English Wikipedia, BooksCorpus datasets. English Wikipedia is ∼similar-to\sim∼20GiB of text containing 6,458,670 articles and BooksCorpus is ∼similar-to\sim∼5GiB of text containing 74,004,228 lines of text. For GPT-2, we use OpenWebText which is an open source replication of WebText dataset from OpenAI. OpenWebtext is around ∼similar-to\sim∼40GiB of text around 8,013,769 articles.

Appendix D Compute Infrastructure

All our pre-trainings were done on Google Cloud Platform (GCP) instances comprising of 8 NVIDIA A100-SXM4-40GB GPUs for BERT(bert-base-uncased: 110M parameters) and 16 NVIDIA A100-SXM4-40GB GPUs for GPT-2(gpt2-small: 124M parameters). In each instance, there are 96 CPU cores with a total RAM of 680GiB. The costs are estimated using https://cloud.google.com/products/calculator based on the time taken for training.

Appendix E Pre-Training performance of Ingenious for BERT

In Table7, we show how validation-set losses change for vanilla BERT and Ingenious BERT over the course of pretraining.

Appendix F GLUE Task wise performance of Ingenious for BERT

We show task-wise performance on GLUE for BERT trained through Ingenious in Table4. We compare against vanilla LM pre-training and baselines. We also report the standard deviation for each task along with the mean glue score.

Appendix G Further implementation details of pre-training BERT through Ingenious

We use Adam optimizer(Kingma and Ba, 2014) with learning rate of 1e-4, β 1=0.9 subscript 𝛽 1 0.9\beta_{1}=0.9 italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β 2=0.99 subscript 𝛽 2 0.99\beta_{2}=0.99 italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, L2 weight decay of 0.01. We warmstart the model for the first 80K training steps and subsequently train only on selected subsets. Training of all the models is performed on 8 NVIDIA A100-SXM4-40GB GPUs.

Appendix H Implementation details of pre-training GPT-2 through Ingenious

For GPT-2, we use OpenWebtext(An open-source replication of WebText dataset from OpenAI) as the pre-training corpus and employ CLM task for pre-training following details in the work of Radford et al. (2019). We perform pre-training using a batch size of 256 (achieved using gradient accumulation of 2 steps) for 1,000,000 steps in case of vanilla GPT-2. With INGENIOUS, we pre-train GPT-2 for 250,000 steps. We set the value of R as 25K steps. We use Adam Optimizer (Kingma and Ba, 2014) with learning rate of 1e-4, β 1 subscript 𝛽 1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β 2 subscript 𝛽 2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, L2 weight decay of 0.01. We warmstart the model the first 65K training steps and subsequently train only on selected subsets. Training of all the models is performed on 16 NVIDIA A100-SXM4-40GB GPUs.

Appendix I Implementation details of pre-training BioBERT through Ingenious

We use Adam optimizer(Kingma and Ba, 2014) with learning rate of 1e-4, β 1=0.9 subscript 𝛽 1 0.9\beta_{1}=0.9 italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β 2=0.99 subscript 𝛽 2 0.99\beta_{2}=0.99 italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, L2 weight decay of 0.01. For Bio-BERT training using Ingenious, we use a subset size of 25% , R 𝑅 R italic_R value of 5000, no model warm-start, i.e., W=0 𝑊 0 W=0 italic_W = 0, and trained the Bio-BERT model for 200,000 steps.

Table 7: Comparison of validation set losses during pre-training. Ingenious achieves almost similar validation set loss as compared to vanilla BERT

Appendix J Subset size for efficiency gains

We study the effect of the size of the subset selected through Ingenious that is used for pre-training BERT. In Table5, we analyse using the following values of subset sizes, viz., 10%, 15%, 20%, 25% and 30% and evaluate the fine-tuning performance on GLUE. While lower subset sizes (10-20%) result in inferior performance owing to the fact that the LM is shown less information, optimal performance is observed when 25% of the pre-training corpus is used, hence, we report corresponding results in Table1.

Appendix K Partitions for efficient subset selection

As discussed in approach, we divide the pre-training dataset into partitions. In Table6, we analyse the impact of performance on GLUE as the number of partitions is varied. Using fewest partitions (1500) is found to yield optimal performance. This aligns with the intuition that fewest partitions enable better subset selection since more samples are present in a single partition, allowing to select more representative samples overall.

Appendix L Few Examples of informative texts sampled by Ingenious

We summarize the three types of redundancies that we found in our analysis of selected subsets. More examples can be found at https://github.com/Efficient-AI/ingenious.

Type 1: Same information conveyed by multiple sentences in different documents.

*   –

Sentence 1: "separate sovereign countries but acted as a single bloc in foreign policy and security issues. the proposed union was being discussed by a joint scandinavian committee during the winter of 1948 – 1949, but the cold war tension between the united states and the soviet union, and preparations for a western alliance that would result in the north atlantic treaty overshadowed the effort. when it became"

*   –

Sentence 2: "they would remain separate sovereign countries but act as a single block in foreign policy and security issues. the proposed union was discussed by a joint scandinavian committee during the winter of 1948 – 1949, but in the end the cold war tension between the united states and the soviet union and preparations for a western alliance that would result in"

Type 2: Duplicates in the corpus.

*   –

Sentence 1: "after we’d been handed our menus. i always get the frozen hot chocolate.” frozen hot chocolate? it’s really a thing? i thought they just made that up.” no,” i said, pointing to the spot on her menu. see? it’s right there.” so, do you order anything else?” cake.” she looked at my deadpanned face and laughed. so, we’re"

*   –

Sentence 2: "what’s good here?” mia asked me after we’d been handed our menus. i always get the frozen hot chocolate.” frozen hot chocolate? it’s really a thing? i thought they just made that up.” no,” i said, pointing to the spot on her menu. see? it’s right there."

Type 3: Recurring patterns of text.

*   –

Sentence 1: "according to the united states census bureau, the village has a total area of, all land. demographics 2010 census as of the census of 2010, there were 377 people, 159 households, and 101 families residing in the village. the population density was. there were 176 housing units at"

*   –

Sentence 2: "according to the united states census bureau, the village has a total area of, all land. demographics 2010 census as of the census of 2010, there were 801 people, 323 households, and 225 families living in the village. the population density was. there were 358 housing units at"

Xet Storage Details

Size:: 103 kB
Xet hash:: 7c21e87ca2cfc2c29e8dcff121389b927c07e9af9076661faae104b35398ce65

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.