Buckets:
Title: Text Quality-Based Pruning for Efficient Training of Language Models
URL Source: https://arxiv.org/html/2405.01582
Published Time: Tue, 14 May 2024 00:09:20 GMT
Markdown Content: Vasu Sharma∗, Karthik Padthe∗, Newsha Ardalani, Kushal Tirumala, Russell Howes
Hu Xu, Po-Yao Huang, Shang-Wen Li, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer
FAIR, Meta
Abstract
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
1 Introduction
Language Models (LMs) have gained significant attention in recent years due to their impressive performance in various natural language processing (NLP) tasks Zhang et al. (2022); Penedo et al. (2023); Touvron et al. (2023); Zhou et al. (2023); Liu et al. (2019). However, their training process often relies on computationally intensive procedures that involve massive datasets and compute requirements which hinders training large scale LMs on noisy real-world or domain specific datasets. What’s worse is that several of these datasets are uncurated and may contain harmful content which the LM model can potentially pick up during the training process Deshpande et al. (2023); Schramowski et al. (2022); Kuchnik et al. (2023).
Text quality evaluation plays a crucial role in assessing the suitability and reliability of textual data for training LMs. Previous research has explored various approaches for text quality assessment, primarily focusing on human annotation and subjective judgments. For instance, Clark et al. (2021) introduce a crowdsourcing-based method for ranking text quality, where human evaluators provide subjective ratings. While such approaches provide valuable insights, they suffer from scalability limitations and subjectivity biases. To overcome these limitations, more recent works have explored the use of automated approaches to quality evaluation such as making use of ChatGPT or GPT-4 to evaluate the quality of the text, where text is designated to be high quality if ChatGPT/GPT-4 deems it to be similar to human text Gilardi et al. (2023); Liu et al. (2023). However, these methods are model dependent and requires training massive LLM models, which defeats the purpose of efficient LM training.
We address this issue by proposing a novel method for numerically evaluating text quality in large unlabelled NLP datasets, with the aim of improving LM training performance and efficiency. We also ensure that our text quality metric is model agnostic, helping us avoid having to recompute these quality metrics for each model. By leveraging this numerical text quality score, we demonstrate how it can be used to prune the original dataset, enabling the training of LMs using only a fraction of the data. Our approach aims to identify and eliminate low-quality text instances, thereby streamlining the training process and mitigating the burden of handling large-scale datasets. We also remove potentially harmful content from the data by ensuring that harmful content is rated poorly by our text quality score which can then be pruned. We observe an absolute improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset Gokaslan et al. (2019) and a 0.8% absolute improvement averaged over 3 models and 14 downstream tasks for the Wikipedia dataset Tunstall et al. when using 20% lesser data and training time .
The key contribution of this paper lies in establishing a framework that quantitatively evaluates text quality in a model agnostic manner and subsequently guides the pruning of NLP datasets for LM training. By leveraging this quality score metric, we enable a more efficient allocation of computational resources and reduce the data requirements for training LMs. This approach not only expedites the training process but also enhances the overall effectiveness of the models. To the best of our knowledge, there doesn’t exist an objective way to evaluate the quality of large scale textual datasets and we hope this work will pave the way for further work in this space.
2 Methodology
2.1 Computing Text Quality
The notion of text “quality" is a fairly ambiguous one. Presently, no concrete and objective method exists for quantitatively evaluating data quality. In this section, we combine commonly used heuristics from literature to formulate a comprehensive definition for text quality. We presently demonstrate the effectiveness of our approach on only English text but the filters and method can be easily extended to other languages. Our proposed method has 2 steps:
- •Weight calculation: In this step we use 14 heuristic based filters covering a wide range of linguistic characteristics like text complexity (measured using parse tree depth and structure), word repetition ratio, syntax of the text (based on presence and relation between objects, nouns and determiners), text length etc. The full list of heuristic filters are listed in Table 2 that identify text that follow attributes of a well formed sentence. We apply each filter individually on a dataset to obtain a data subset corresponding to each filter with text instances qualifying that specific filter. These subsets are used as evaluation datasets for a pre-trained LM to calculate validation perplexity (PPL 𝑃 𝑃 𝐿 PPL italic_P italic_P italic_L) for filtered subsets and the original unfiltered dataset. We implement our filters using spacy Honnibal et al. (2020) and for validation perplexity calculation we use HuggingFace based pre-trained language model Wolf et al. (2020). We then use following formulation to calculate weight for each heuristic:
w i=max(0,PPL all−PPL i PPL all)subscript 𝑤 𝑖 𝑚 𝑎 𝑥 0 𝑃 𝑃 subscript 𝐿 𝑎 𝑙 𝑙 𝑃 𝑃 subscript 𝐿 𝑖 𝑃 𝑃 subscript 𝐿 𝑎 𝑙 𝑙 w_{i}=max(0,\frac{PPL_{all}-PPL_{i}}{PPL_{all}})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m italic_a italic_x ( 0 , divide start_ARG italic_P italic_P italic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT - italic_P italic_P italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_P italic_P italic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT end_ARG )(1)
Here w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is weight for the i th superscript 𝑖 𝑡 ℎ i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT filter where i=1,2,…14 𝑖 1 2…14 i=1,2,...14 italic_i = 1 , 2 , … 14, PPL i 𝑃 𝑃 subscript 𝐿 𝑖 PPL_{i}italic_P italic_P italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the perplexity for the subset created after applying filter i 𝑖 i italic_i and PPL all 𝑃 𝑃 subscript 𝐿 𝑎 𝑙 𝑙 PPL_{all}italic_P italic_P italic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT is the perplexity of the unfiltered dataset. We lower bound the weights to 0 0 for filters where PPL 𝑃 𝑃 𝐿 PPL italic_P italic_P italic_L goes up to avoid negative weights. The chosen 14 filters were selected from a diverse set of over a 50+ filters based on consistent perplexity improvements, leading to them consistently being assigned a higher weight. The final weights assigned to each of the filters are presented in Figure 2. The simplicity of the chosen filters make it extremely fast to compute these quality scores while increasing their generalization abilities across datasets. For example, it takes 26.41s to compute the scores for 10k lines of text on a single CPU core, and the computation could be easily parallelized across multiple cores while observing linear speedup in throughput with number of cores.
- •Quality scoring: In this step each document in the dataset is split into lines based on common sentence end markers like period or HTML end tags and for each line all the heuristic filters are applied that results in an indicator matrix I 𝐼 I italic_I where I i(line)=1 subscript 𝐼 𝑖 𝑙 𝑖 𝑛 𝑒 1 I_{i}(line)=1 italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l italic_i italic_n italic_e ) = 1 indicates that line 𝑙 𝑖 𝑛 𝑒 line italic_l italic_i italic_n italic_e satisfies the i th superscript 𝑖 𝑡 ℎ i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT filter criteria. Then we use the weights calculated in the above step to get quality score per line. This can be formulated as:
score line=∑i=1 F w iI i(line)∑i=1 F w i 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑙 𝑖 𝑛 𝑒 superscript subscript 𝑖 1 𝐹 subscript 𝑤 𝑖 subscript 𝐼 𝑖 𝑙 𝑖 𝑛 𝑒 superscript subscript 𝑖 1 𝐹 subscript 𝑤 𝑖 score_{line}=\frac{\sum_{i=1}^{F}w_{i}I_{i}(line)}{\sum_{i=1}^{F}w_{i}}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_l italic_i italic_n italic_e ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG(2)
Here score line 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑙 𝑖 𝑛 𝑒 score_{line}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT is the quality score assigned to line 𝑙 𝑖 𝑛 𝑒 line italic_l italic_i italic_n italic_e, w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight for filter i 𝑖 i italic_i, F 𝐹 F italic_F is the number of filters we use and I i subscript 𝐼 𝑖 I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the indicator function for filter i 𝑖 i italic_i. We then aggregate the scores for each line in the document to obtain document level score by taking a weighted average of scores of each line in the document, where the line weights are proportional to the token length of the line. Following is the formulation used for the doc score:
score doc=∑line=1 n tc linescore line∑line=1 n tc line 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑑 𝑜 𝑐 superscript subscript 𝑙 𝑖 𝑛 𝑒 1 𝑛 𝑡 subscript 𝑐 𝑙 𝑖 𝑛 𝑒 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑙 𝑖 𝑛 𝑒 superscript subscript 𝑙 𝑖 𝑛 𝑒 1 𝑛 𝑡 subscript 𝑐 𝑙 𝑖 𝑛 𝑒 score_{doc}=\frac{\sum_{line=1}^{n}tc_{line}score_{line}}{\sum_{line=1}^{n}tc_% {line}}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_o italic_c end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_t italic_c start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_t italic_c start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT end_ARG(3)
Here score doc 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑑 𝑜 𝑐 score_{doc}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_o italic_c end_POSTSUBSCRIPT is the aggregated quality score for the doc, tc line 𝑡 subscript 𝑐 𝑙 𝑖 𝑛 𝑒 tc_{line}italic_t italic_c start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT is the token count for the line, score line 𝑠 𝑐 𝑜 𝑟 subscript 𝑒 𝑙 𝑖 𝑛 𝑒 score_{line}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT is the score for the line calculated as per equation 2 and n 𝑛 n italic_n is the total count of lines in the doc.
Our method is completely model agnostic and relies solely on the underlying data and hence can be generalized to the training of any downstream LM model.
2.2 Quality guided data pruning
With the computed text quality scores, we prune the dataset by selecting the desired fraction of the dataset by retaining highest quality samples. The threshold can be determined based on the specific requirements of the LM training task and the available computational resources. Instances with text quality scores below the threshold are considered low-quality and are removed from the dataset. The remaining high-quality instances form the pruned dataset for subsequent LM training. By training the LM on the pruned dataset, we demonstrate that the model can achieve comparable or even improved performance with significantly fewer training instances, leading to improved LM training. In this work we use percentile based pruning, where we select data subset with quality score in top 20%, 40%, 60% and 80% and compare its performance to the models trained on the unpruned datasets as baseline.
Figure 1: Change in accuracy for pruned datasets compared to no pruning for OpenWebText and Wikipedia data
3 Experimental Details
3.1 Datasets
We experiment with a english only versions of following datasets for our study:
- •WikipediaTunstall et al. : This dataset is built from the wikipedia dump where each sample contains whole wikipedia article. This dataset contains 4.67 billion tokens before pruning or splitting into train and validation sets.
- •OpenWebtext Gokaslan et al. (2019): This dataset is the open source version of the WebText dataset used for GPT-2 training. To build this dataset Reddit post urls from Reddit submission dataset with html content were used. The base version of the dataset contains 9.03 billion tokens.
Table 1: Samples of lines with assigned quality scores.
3.2 Models
To ensure the consistency and generalizability of our study, we experiment with a diverse set of popular models including GPT2 Radford et al. (2019), GPT-Neo-125M Black et al. (2022), Pythia-160M Biderman et al. (2023) and OPT-125M Zhang et al. (2022). All the models are trained from scratch with 15 epochs and batch size of 128, we use HuggingFace trainer to train our models.
3.3 Evaluation
We follow the evaluation setup consistent with OPT Zhang et al. (2022). We calculate validation perplexity for each of the dataset where validation set is 20% of the whole dataset sampled before pruning and is removed from the training data used for pruning. We also evaluate 0-shot accuracy of all trained models on 14 downstream NLP tasks. These 14 NLP tasks include Arc Challenge and ARC Easy Clark et al. (2018), HellaSwag Zellers et al. (2019), OpenBookQA Mihaylov et al. (2018), PIQA Bisk et al. (2019), StoryCloze Schwartz et al. (2017), Winograd Levesque et al. (2012), Winogrande Sakaguchi et al. (2019) and tasks from SuperGLUE Wang et al. (2020). We use lm-evalaution-harness Gao et al. (2021) to downstream task based evaluation.
4 Results and Analysis
We compute the text quality score for the OpenWebText and Wikipedia datasets. Table 1 shows some samples texts from these datasets and the text quality scores they get assigned based on our method. As can clearly be seen, the higher quality sentences in terms of content, grammatical and linguistic quality do seem to consistently be rated as higher quality by our approach.
Next we analyze the results obtained from our pruning experiments using data quality as a measure to eliminate lower quality samples. Figure 1 presents the average change in accuracy (%) using the model trained on the unpruned datasets as the baseline. The accuracy is averaged over the 14 downstream tasks as explained in the previous section. Variations in individual task accuracies are presented in the Appendix. As can be seen, for most models, the performance seems to improve with lower pruning levels up to a threshold and then declines sharply. For OpenWebText, most models achieve peak performance at around 40% pruning level while the same can be seen for Wikipedia data at around 20% pruning level. This points to the presence of a subset of low quality data in these datasets, which can be removed from model training without affecting downstream model performance while significantly improving data efficiency and the time needed to train these models. Note that the trends as observed in downstream model performance are consistent yet a little noisy, as has often been observed in prior literature Zhang et al. (2022); Wang et al. (2020).
We further analyze the variation in perplexity over the validation set for GPT2 Radford et al. (2019), Pythia-160M Biderman et al. (2023) and OPT-125M Zhang et al. (2022) trained over different pruning levels for both OpenWebText and Wikipedia reveal a consistent trend of perplexity of the trained LM models increasing with more data being pruned as can be seen in Figure 1. The increase in perplexity is fairly gradual to a certain level (20% for Wikipedia and 40% for OpenWebText) and then increases significantly faster beyond that pruning level. The sudden increase in perplexity beyond a threshold points to the fact that the data being pruned after that threshold is potentially high quality data.
The contributions of our work extend beyond the immediate scope of LM training. The introduced text quality evaluation framework provides a foundation for further advancements in the field, enabling researchers to objectively assess the quality of large-scale textual datasets. This paves the way for future research on improving data curation, dataset selection, and the development of automated methods for text quality assessment.
Limitations
While our research provides promising results and demonstrates the effectiveness of text quality evaluation and dataset pruning for improving the training efficiency of Language Models (LMs), there are several limitations that should be considered. These limitations highlight the potential areas for further investigation and exploration in future research.
4.1 Generalizability to Larger Models
One limitation of our work is that we primarily focus on LM models with a relatively smaller number of parameters. The effectiveness of our approach needs to be further tested and validated on much larger models, such as models with hundreds of billions of parameters like Falcon40B Almazrouei et al. (2023), LLaMa Touvron et al. (2023), OPT-175B Zhang et al. (2022) among others. Larger models often exhibit different training dynamics and may require different considerations when it comes to dataset pruning. Therefore, future research should investigate the scalability and applicability of our methodology to such larger models.
4.2 Scalability to Larger Datasets
Another limitation is the scale of the datasets used in our experiments. While we have conducted experiments on large-scale datasets, future research should explore the effectiveness of our approach on even larger datasets, involving billions of samples like the Pile dataset Gao et al. (2020). Training LLM models on such massive datasets poses unique challenges in terms of computational resources, data storage, and training time. Evaluating the scalability and practicality of our approach on such datasets will provide a more comprehensive understanding of its potential benefits and limitations.
4.3 Evaluation Metrics and Robustness
We have primarily evaluated the effectiveness of our approach based on standard evaluation metrics such as perplexity on the validation set and accuracy on 14 downstream evaluation tasks. However, the evaluation of LM models goes beyond these metrics, and future research should explore additional evaluation criteria such as robustness, fairness, and interpretability. Understanding the impact of dataset pruning on these aspects will provide a more comprehensive assessment of our approach’s efficacy.
Ethics Statement
While our work addresses the issue of harmful content in datasets through the application of text quality evaluation, ethical considerations surrounding bias, fairness, and inclusivity in LM training remain significant challenges. Further research is needed to develop methodologies that effectively address these ethical concerns and ensure the responsible deployment of LM models in real-world applications.
References
- Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
- Bisk et al. (2019) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. Piqa: Reasoning about physical commonsense in natural language.
- Black et al. (2022) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
- Clark et al. (2021) Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021. All that’s ’human’ is not gold: Evaluating human evaluation of generated text.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models.
- Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling.
- Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation.
- Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks.
- Gokaslan et al. (2019) Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. 2019. Openwebtext corpus.
- Honnibal et al. (2020) Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python.
- Kuchnik et al. (2023) Michael Kuchnik, Virginia Smith, and George Amvrosiadis. 2023. Validating large language models with relm.
- Levesque et al. (2012) Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. AAAI Press.
- Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering.
- Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
- Radenovic et al. (2023) Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, and Dhruv Mahajan. 2023. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv preprint arXiv:2301.02280.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Sakaguchi et al. (2019) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale.
- Schramowski et al. (2022) Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A. Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human-like biases of what is right and wrong to do.
- Schwartz et al. (2017) Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. 2017. Story cloze task: UW NLP system. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. Association for Computational Linguistics.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
- (25) Lewis Tunstall, Mariama Barham, Thomas Wolf, Quentin Lhoest, and Patrick von Platen. Wikimedia downloads.
- Wang et al. (2020) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2020. Superglue: A stickier benchmark for general-purpose language understanding systems.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?
- Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Zhou et al. (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less is more for alignment.
5 Appendix
Figure 2: Assigned weights for all the filters.
Table 2: Set of heuristics used for quality score calculation.
Figure 3: Change in accuracy of models trained on pruned data compared to unpruned data for all the 14 tasks on OpenWebText
Figure 4: Change in accuracy of models trained on pruned data compared to unpruned data for all the 14 tasks on Wikipedia
Xet Storage Details
- Size:
- 33.5 kB
- Xet hash:
- 2b322842ac9d0c52f8c4ae3d20a3c16e7162d8c2b60c40a7d3cbb33727c18cf2
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.



