diff --git "a/data/documents/content/e3591269-f449-4f88-8185-7d98ef4c4845.txt" "b/data/documents/content/e3591269-f449-4f88-8185-7d98ef4c4845.txt" deleted file mode 100644--- "a/data/documents/content/e3591269-f449-4f88-8185-7d98ef4c4845.txt" +++ /dev/null @@ -1,4802 +0,0 @@ ---- Page 1 --- -A Comprehensive Overview of Large Language Models -Humza Naveeda, Asad Ullah Khanb,∗, Shi Qiuc,∗, Muhammad Saqibd,e,∗, Saeed Anwarf,g, Muhammad Usmanf,g, Naveed Akhtarh,j, -Nick Barnesi, Ajmal Mianj -aThe University of Sydney, Sydney, Australia -bUniversity of Engineering and Technology (UET), Lahore, Pakistan -cThe Chinese University of Hong Kong (CUHK), HKSAR, China -dUniversity of Technology Sydney (UTS), Sydney, Australia -eCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia -fKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia -gSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia -hThe University of Melbourne (UoM), Melbourne, Australia -iAustralian National University (ANU), Canberra, Australia -jThe University of Western Australia (UWA), Perth, Australia -Abstract -Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and -beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse -topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, -robotics, datasets, benchmarking, e fficiency, and more. With the rapid development of techniques and regular breakthroughs in -LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering -the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise -yet comprehensive overview of the recent developments in this field. This article provides an overview of the literature on a broad -range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts -along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to provide not only a -systematic survey but also a quick, comprehensive reference for the researchers and practitioners to draw insights from extensive, -informative summaries of the existing works to advance the LLM research. -Keywords: -Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking -1. Introduction -Language plays a fundamental role in facilitating commu- -nication and self-expression for humans and their interaction -with machines. The need for generalized models stems from -the growing demand for machines to handle complex language -tasks, including translation, summarization, information re- -trieval, conversational interactions, etc. Recently, significant -breakthroughs have been witnessed in language models, pri- -marily attributed to transformers [1], increased computational -capabilities, and the availability of large-scale training data. -These developments have brought about a revolutionary trans- -formation by enabling the creation of LLMs that can approxi- -mate human-level performance on various tasks [2, 3]. Large -∗Equal contribution -Email addresses: humza_naveed@yahoo.com (Humza Naveed), -aukhanee@gmail.com (Asad Ullah Khan), shiqiu@cse.cuhk.edu.hk (Shi -Qiu), muhammad.saqib@data61.csiro.au (Muhammad Saqib), -saeed.anwar@kfupm.edu.sa (Saeed Anwar), -muhammad.usman@kfupm.edu.sa (Muhammad Usman), -naveed.akhtar1@unimelb.edu.au (Naveed Akhtar), -nick.barnes@anu.edu.au (Nick Barnes), ajmal.mian@uwa.edu.au -(Ajmal Mian) -Figure 1: The trend of papers released over the years containing keywords -"Large Language Model", "Large Language Model +Fine-Tuning", and "Large -Language Model +Alignment". -Preprint submitted to Elsevier October 18, 2024arXiv:2307.06435v10 [cs.CL] 17 Oct 2024 - ---- Page 2 --- -2019T5 (Oct) -GPT-3 (May) -WebGPT (Dec) -OPT-IML -TK-Instruct (May) -mT0 (Dec) -Wizard-LM -Vicuna -Alpaca (Mar) -HuaTuo (Apr) -Koala (May)Wizard-Coder (Jun) -Goat -PanGu-α (Apr) -CPM-2 (Jun) -GPT-NeoX-20B (Apr) -CodeGen (Mar) -Galactica (Nov) -GLM (Oct) -OPT -UL2 (May) -LLaMA (Feb) -LLaMA 2 (Jul) -MPT (Jun) -CodeT5+ -Code Llama (Aug) -StarCoder -Xuan Yuan 2.0 (May) -20202021202220232024mT5 (Oct) -HyperCLOVA (Sep) -ERNIE 3.0 -Codex (Jul) -Jurassic-1 (Aug) -Yuan 1.0 (Oct) -Gopher (Dec) -ERNIE 3.0 Titan -GLaM -LaMDA -T0 (Oct) -ChatGPT (Nov) -Sparrow (Sep) -FLAN-U-PaLM (Oct) -Bard (Oct) -MT-NLG (Jan) -AlphaCode (Feb) -Chinchilla (Mar) -PaLM (Apr) -U-PALM (Oct) -BLOOM (Nov) -AlexaTM (Aug) -PaLM2 (May) -GPT-4 -PanGu-Σ (Mar) -BloombergGPT -Claude -Gemini (Dec) -DeepSeek (Jan) -LLaMA 3 -Grok-1 (Mar) -Snowflake Arctic (Apr)DeepSeek-V2 (May) -Mixtral 8x22B -Nemotron (Feb) -GPT-4o (May) -OpenAI o1 (Sep) -Gemini-1.5 (Feb) -Grok-1.5 (Apr) -Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models -on the upper half signify open-source availability, whereas those on the bottom are closed-source. The chart illustrates the increasing trend towards instruction-tuned -and open-source models, highlighting the evolving landscape and trends in natural language processing research. -Language Models (LLMs) have emerged as cutting-edge arti- -ficial intelligence systems that can process and generate text -with coherent communication [4] and generalize to multiple -tasks [5, 6]. -The historical progress in natural language processing (NLP) -evolved from statistical to neural language modeling and then -from pre-trained language models (PLMs) to LLMs. While -conventional language modeling (LM) trains task-specific mod- -els in supervised settings, PLMs are trained in a self-supervised -setting on a large corpus of text [7, 8, 9] with the aim of learning -a generic representation that is shareable among various NLP -tasks. After fine-tuning for downstream tasks, PLMs surpass -the performance gains of traditional language modeling (LM). -The larger PLMs bring more performance gains, which has led -to the transitioning of PLMs to LLMs by significantly increas- -ing model parameters (tens to hundreds of billions) [10] and -training dataset (many GBs and TBs) [10, 11]. Following this -development, numerous LLMs have been proposed in the lit- -erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the -number of released LLMs and names of a few significant LLMs -proposed over the years are shown in Fig 1 and Fig 2, respec- -tively. -The early work on LLMs, such as T5 [10] and mT5 [11] em- -ployed transfer learning until GPT-3 [6] showed LLMs are -zero-shot transferable to downstream tasks without fine-tuning. -LLMs accurately respond to task queries when prompted with -task descriptions and examples. However, pre-trained LLMs -fail to follow user intent and perform worse in zero-shot set- -tings than in few-shot. Fine-tuning them with task instruc- -tions data [16, 17, 18, 19] and aligning with human prefer- -ences [20, 21] enhances generalization to unseen tasks, im- -proving zero-shot performance significantly and reducing mis- -aligned behavior. -In addition to better generalization and domain adaptation, -LLMs appear to have emergent abilities, such as reasoning, -planning, decision-making, in-context learning, answering in -zero-shot settings, etc. These abilities are known to be ac- -quired by them due to their gigantic scale even when the pre- -trained LLMs are not trained specifically to possess these at- -tributes [22, 23, 24]. Such abilities have led LLMs to be widely -adopted in diverse settings, including multi-modal, robotics,tool manipulation, question answering, autonomous agents, etc. -Various improvements have also been suggested in these areas -either by task-specific training [25, 26, 27, 28, 29, 30, 31] or -better prompting [32]. -The LLMs abilities to solve diverse tasks with human-level -performance come at the cost of slow training and inference, -extensive hardware requirements, and higher running costs. -Such requirements have limited their adoption and opened up -opportunities to devise better architectures [15, 33, 34, 35] -and training strategies [36, 37, 21, 38, 39, 40, 41]. Param- -eter e fficient tuning [38, 41, 40], pruning [42, 43], quantiza- -tion [44, 45], knowledge distillation, and context length inter- -polation [46, 47, 48, 49] among others are some of the methods -widely studied for e fficient LLM utilization. -Due to the success of LLMs on a wide variety of tasks, the -research literature has recently experienced a large influx of -LLM-related contributions. Researchers have organized the -LLMs literature in surveys [50, 51, 52, 53], and topic-specific -surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our -contribution focuses on providing a comprehensive yet concise -overview of the general direction of LLM research. This arti- -cle summarizes architectural and training details of pre-trained -LLMs and delves deeper into the details of concepts like fine- -tuning, multi-modal LLMs, augmented LLMs, datasets, eval- -uation, applications, challenges, and others to provide a self- -contained comprehensive overview. Our key contributions are -summarized as follows. -•We present a survey on the developments in LLM research, -providing a concise, comprehensive overview of the direc- -tion. -•We present extensive summaries of pre-trained models that -include fine-grained details of architecture and training de- -tails. -•We summarize major findings of the popular contributions -and provide a detailed discussion on the key design and -development aspects of LLMs to help practitioners e ffec- -tively leverage this technology. -•In this self-contained article, we cover a range of con- -cepts to present the general direction of LLMs compre- -hensively, including background, pre-training, fine-tuning, -2 - ---- Page 3 --- -Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. E fficient 4. Inference 5. Evaluation 6. Applications -7. Challenges -multi-modal LLMs, augmented LLMs, LLMs-powered -agents, datasets, evaluation, etc. -We loosely follow the existing terminology to ensure a stan- -dardized outlook of this research direction. For instance, fol- -lowing [50], our survey discusses pre-trained LLMs with 10B -parameters or more. We refer the readers interested in smaller -pre-trained models to [51, 52, 53]. -The organization of this paper is as follows. Section 2 discusses -the background of LLMs. Section 3 focuses on LLMs overview, -architectures, training pipelines and strategies, fine-tuning, andutilization in di fferent domains. Section 4 highlights the config- -uration and parameters that play a crucial role in the function- -ing of these models. Summary and discussions are presented -in section 3.8. The LLM training and evaluation, datasets, and -benchmarks are discussed in section 5, followed by challenges -and future directions, and conclusion in sections 7 and 8, re- -spectively. -3 - ---- Page 4 --- -2. Background -We provide the relevant background to understand the fun- -damentals related to LLMs in this section. We briefly discuss -necessary components in LLMs and refer the readers interested -in details to the original works. -2.1. Tokenization -Tokenization [59] is an essential pre-processing step in -LLM training that parses the text into non-decomposing units -called tokens. Tokens can be characters, subwords [60], sym- -bols [61], or words, depending on the tokenization process. -Some of the commonly used tokenization schemes in LLMs -include wordpiece [62], byte pair encoding (BPE) [61], and un- -igramLM [60]. Readers are encouraged to refer to [63] for a -detailed survey. -2.2. Encoding Positions -The transformer processes input sequences in parallel and -independently of each other. Moreover, the attention mod- -ule in the transformer does not capture positional information. -As a result, positional encodings were introduced in trans- -former [64], where a positional embedding vector is added to -the token embedding. Variants of positional embedding include -absolute, relative, or learned positional encodings. Within rel- -ative encoding, Alibi and RoPE are two widely used positional -embeddings in LLMs. -Alibi [65]: It subtracts a scalar bias from the attention score -that increases with the distance between token positions. This -favors using recent tokens for attention. -RoPE [66]: It rotates query and key representations at an an- -gle proportional to the token absolute position in the input -sequence, resulting in a relative positional encoding scheme -which decays with the distance between the tokens. -2.3. Attention in LLMs -Attention assigns weights to input tokens based on impor- -tance so that the model gives more emphasis to relevant tokens. -Attention in transformers [64] calculates query, key, and value -mappings for input sequences, where the attention score is -obtained by multiplying the query and key, and later used to -weight values. We discuss di fferent attention strategies used in -LLMs below. -Self-Attention [64]: Calculates attention using queries, keys, -and values from the same block (encoder or decoder). -Cross Attention: It is used in encoder-decoder architectures, -where encoder outputs are the queries, and key-value pairs -come from the decoder. -Sparse Attention [67]: Self-attention has O(n2) time complex- -ity which becomes infeasible for large sequences. To speed -up the computation, sparse attention [67] iteratively calculates -attention in sliding windows for speed gains. -Flash Attention [68]: Memory access is the major bottleneck -in calculating attention using GPUs. To speed up, flash -attention employs input tiling to minimize the memory reads -and writes between the GPU high bandwidth memory (HBM) -and the on-chip SRAM.2.4. Activation Functions -The activation functions serve a crucial role in the curve- -fitting abilities of neural networks [69]. We discuss activation -functions used in LLMs in this section. -ReLU [70]: The Rectified linear unit (ReLU) is defined as: -ReLU (x)=max(0,x) (1) -GeLU [71]: The Gaussian Error Linear Unit (GeLU) is the -combination of ReLU, dropout [72] and zoneout [73]. -GLU variants [74]: The Gated Linear Unit [75] is a neural -network layer that is an element-wise product ( ⊗) of a linear -transformation and a sigmoid transformed ( σ) linear projection -of the input given as: -GLU (x,W,V,b,c)=(xW+b)⊗σ(xV+c), (2) -where Xis the input of layer and l,W,b,Vandcare learned -parameters. Other GLU variants [74] used in LLMs are: -ReGLU (x,W,V,b,c)=max(0,xW+b)⊗, -GEGLU (x,W,V,b,c)=GELU (xW+b)⊗(xV+c), -S wiGLU (x,W,V,b,c,β)=S wishβ(xW+b)⊗(xV+c). -2.5. Layer Normalization -Layer normalization leads to faster convergence and is an in- -tegrated component of transformers [64]. In addition to Layer- -Norm [76] and RMSNorm [77], LLMs use pre-layer normal- -ization [78], applying it before multi-head attention (MHA). -Pre-norm is shown to provide training stability in LLMs. An- -other normalization variant, DeepNorm [79] fixes the issue with -larger gradients in pre-norm. -2.6. Distributed LLM Training -This section describes distributed LLM training approaches -briefly. More details are available in [13, 37, 80, 81]. -Data Parallelism: Data parallelism replicates the model on -multiple devices where data in a batch gets divided across de- -vices. At the end of each training iteration weights are synchro- -nized across all devices. -Tensor Parallelism: Tensor parallelism shards a tensor compu- -tation across devices. It is also known as horizontal parallelism -or intra-layer model parallelism. -Pipeline Parallelism: Pipeline parallelism shards model layers -across di fferent devices. This is also known as vertical paral- -lelism. -Model Parallelism: A combination of tensor and pipeline par- -allelism is known as model parallelism. -3D Parallelism: A combination of data, tensor, and model par- -allelism is known as 3D parallelism. -Optimizer Parallelism: Optimizer parallelism also known as -zero redundancy optimizer [37] implements optimizer state -partitioning, gradient partitioning, and parameter partitioning -across devices to reduce memory consumption while keeping -the communication costs as low as possible. -4 - ---- Page 5 --- -2.7. Libraries -Some commonly used libraries for LLMs training are: -Transformers [82]: The library provides access to various pre- -trained transformer models with APIs to train, fine-tune, infer, -and develop custom models. -DeepSpeed [36]: A library for scalable distributed training and -inference of deep learning models. -Megatron-LM [80]: It provides GPU-optimized techniques for -large-scale training of LLMs. -JAX [83]: A Python library for high-performance numerical -computing and scaleable machine learning. It can di fferenti- -ate native Python and NumPy functions and execute them on -GPUs. -Colossal-AI [84]: A collection of components to write dis- -tributed deep learning models. -BMTrain [81]: A library to write e fficient stand-alone LLMs -training code. -FastMoE [85]: Provides API to build mixture-of-experts -(MoE) model in PyTorch. -MindSpore [86]: A deep learning training and inference frame- -work extendable to mobile, edge, and cloud computing. -PyTorch [87]: A framework developed by Facebook AI Re- -search lab (FAIR) to build deep learning models. The main -features of PyTorch include a dynamic computation graph and -a pythonic coding style. -Tensorflow [88]: A deep learning framework written by -Google. The key features of TensorFlow are graph-based com- -putation, eager execution, scalability, etc. -MXNet [89]: Apache MXNet is a deep learning framework -with support to write programs in multiple languages, includ- -ing, Python, C ++, Scala, R, etc. It also provides support for -dynamic and static computation graphs. -2.8. Data PreProcessing -This section briefly summarizes data preprocessing tech- -niques used in LLMs training. -Quality Filtering: For better results, training data quality is -essential. Some approaches to filtering data are: 1) classifier- -based and 2) heuristics-based. Classifier-based approaches -train a classifier on high-quality data and predict the quality of -text for filtering, whereas heuristics-based employ some rules -for filtering like language, metrics, statistics, and keywords. -Data Deduplication: Duplicated data can a ffect model per- -formance and increase data memorization; therefore, to train -LLMs, data deduplication is one of the preprocessing steps. -This can be performed at multiple levels, like sentences, -documents, and datasets. -Privacy Reduction: Most of the training data for LLMs is -collected through web sources. This data contains private -information; therefore, many LLMs employ heuristics-based -methods to filter information such as names, addresses, and -phone numbers to avoid learning personal information. -2.9. Architectures -Here we discuss the variants of the transformer architectures -used in LLMs. The di fference arises due to the application of -Figure 4: An example of attention patterns in language models, image is taken -from [93]. -Figure 5: An example of language model training objectives, image from [93]. -the attention and the connection of transformer blocks. An il- -lustration of attention patterns of these architectures is shown -in Figure 4. -Encoder Decoder: This architecture processes inputs through -the encoder and passes the intermediate representation to the -decoder to generate the output. Here, the encoder sees the -complete sequence utilizing self-attention whereas the decoder -processes the sequence one after the other with implementing -cross-attention. -Causal Decoder: A type of architecture that does not have an -encoder and processes and generates output using a decoder, -where the predicted token depends only on the previous time -steps. -Prefix Decoder: It is also known as a non-causal decoder, -where the attention calculation is not strictly dependent on the -past information and the attention is bidirectional. An example -of a non-causal attention mask is shown in Figure 4. -Mixture-of-Experts: It is a variant of transformer architecture -with parallel independent experts and a router to route tokens -to experts. These experts are feed-forward layers after the at- -tention block [90]. Mixture-of-Experts (MoE) is an e fficient -sparse architecture that o ffers comparable performance to dense -models and allows increasing the model size without increas- -ing the computational cost by activating only a few experts at a -time [91, 92]. -2.10. Pre-Training Objectives -This section describes LLMs pre-training objectives. For -more details see the paper [93]. -Full Language Modeling: An autoregressive language model- -ing objective where the model is asked to predict future tokens -given the previous tokens, an example is shown in Figure 5. -Prefix Language Modeling: A non-causal training objective, -where a prefix is chosen randomly and only remaining target -tokens are used to calculate the loss. An example is shown in -Figure 5. -5 - ---- Page 6 --- -Figure 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting /utilization. Prompting LLMs to generate responses is possible at -different training stages like pre-training, instruction-tuning, or alignment tuning. “RL” stands for reinforcement learning, “RM” represents reward-modeling, and -“RLHF” represents reinforcement learning with human feedback. -Masked Language Modeling: In this training objective, tokens -or spans (a sequence of tokens) are masked randomly and the -model is asked to predict masked tokens given the past and -future context. An example is shown in Figure 5. -Unified Language Modeling: Unified language modeling [94] -is a combination of causal, non-causal, and masked language -training objectives. Here in masked language modeling, the -attention is not bidirectional but unidirectional, attending either -left-to-right or right-to-left context. -2.11. LLMs Scaling Laws -Scaling laws study the optimal combination of model param- -eters, dataset size, and computational resources that predict the -improvement in the model performance. It has been shown -that the loss scales according to the power-law with model size, -dataset size, and compute resources [95]. This study suggests -larger models are more important than big data for better perfor- -mance. Another variant of scaling law [96] suggests the model -size and the number of training tokens should be scaled equally.2.12. LLMs Adaptation Stages -This section discusses the fundamentals of LLMs adaptation -stages, from pre-training to fine-tuning for downstream tasks -and utilization. An example of di fferent training stages and in- -ference in LLMs is shown in Figure 6. In this paper, we refer -to alignment-tuning as aligning with human preferences, while -occasionally the literature uses the term alignment for di fferent -purposes. -2.12.1. Pre-Training -In the very first stage, the model is trained in a self- -supervised manner on a large corpus to predict the next to- -kens given the input. The design choices of LLMs vary from -encoder-decoder to decoder-only architectures with di fferent -building blocks and loss functions in sections 2.5, 2.4, 2.10. -2.12.2. Fine-Tuning -There are di fferent styles to fine-tune an LLM. This section -briefly discusses fine-tuning approaches. -Transfer Learning: The pre-trained LLMs perform well for -various tasks [6, 15]. However, to improve the performance for -6 - ---- Page 7 --- -a downstream task, pre-trained models are fine-tuned with the -task-specific data [10, 11], known as transfer learning. -Instruction-tuning: To enable a model to respond to user -queries e ffectively, the pre-trained model is fine-tuned on in- -struction formatted data i.e., instruction and an input-output -pair. Instructions generally comprise multi-task data in plain -natural language, guiding the model to respond according to the -prompt and the input. This type of fine-tuning improves zero- -shot generalization and downstream task performance. Details -on formatting instruction data and its various styles are avail- -able in [16, 50, 97]. -Alignment-tuning: LLMs are prone to generating false, biased, -and harmful text. To make them helpful, honest, and harmless, -models are aligned using human feedback. Alignment involves -asking LLMs to generate unexpected responses and then updat- -ing their parameters to avoid such responses [20, 21, 98]. -It ensures LLMs operate according to human intentions and -values. A model is defined to be an “aligned” model if the -model fulfills three criteria of helpful, honest, and harmless or -“HHH” [99]. -Researchers employ reinforcement learning with human feed- -back (RLHF) [100] for model alignment. In RLHF, a fine-tuned -model on demonstrations is further trained with reward model- -ing (RM) and reinforcement learning (RL), shown in Figure 6. -Below we briefly discuss RM and RL pipelines in RLHF. -Reward modeling: trains a model to rank generated responses -according to human preferences using a classification objec- -tive. To train the classifier humans annotate LLMs generated -responses based on the HHH criteria. -Reinforcement learning: in combination with the reward model -is used for alignment in the next stage. The previously trained -reward model ranks LLM-generated responses into preferred -vs. non-preferred, which is used to align the model with proxi- -mal policy optimization (PPO). This process repeats iteratively -until convergence. -2.12.3. Prompting /Utilization -Prompting is a method to query trained LLMs for generating -responses, as illustrated in Figure 6. LLMs can be prompted in -various prompt setups, where they can be adapted to the instruc- -tions without fine-tuning and in other cases with fine-tuning on -data containing di fferent prompt styles [16, 101, 102]. A good -guide on prompt engineering is available at [32]. Below, we -will discuss various widely used prompt setups. -Zero-Shot Prompting: LLMs are zero-shot learners and ca- -pable of answering queries never seen before. This style of -prompting requires LLMs to answer user questions without see- -ing any examples in the prompt. -In-context Learning: Also known as few-shot learning, here, -multiple input-output demonstration pairs are shown to the -model to generate the desired response. This adaptation style -is also called few-shot learning. A discussion on formatting in- -context learning (ICL) templates is available in [54, 50, 18, 16]. -Reasoning in LLMs: LLMs are zero-shot reasoners and can -be provoked to generate answers to logical problems, task -planning, critical thinking, etc. with reasoning. Generating -reasons is possible only by using di fferent prompting styles,whereas to improve LLMs further on reasoning tasks many -methods [16, 97] train them on reasoning datasets. We discuss -various prompting techniques for reasoning below. -Chain-of-Thought (CoT): A special case of prompting where -demonstrations contain reasoning information aggregated with -inputs and outputs so that the model generates outcomes with -step-by-step reasoning. More details on CoT prompts are avail- -able in [55, 103, 101]. -Self-Consistency: Improves CoT performance by generat- -ing multiple responses and selecting the most frequent an- -swer [104]. -Tree-of-Thought (ToT): Explores multiple reasoning paths -with possibilities to look ahead and backtrack for problem- -solving [105]. -Single-Turn Instructions: In this prompting setup, LLMs are -queried only once with all the relevant information in the -prompt. LLMs generate responses by understanding the con- -text either in a zero-shot or few-shot setting. -Multi-Turn Instructions: Solving a complex task requires mul- -tiple interactions with LLMs, where feedback and responses -from the other tools are given as input to the LLM for the next -rounds. This style of using LLMs in the loop is common in -autonomous agents. -3. Large Language Models -This section reviews LLMs, briefly describing their architec- -tures, training objectives, pipelines, datasets, and fine-tuning -details. -3.1. Pre-Trained LLMs -Here, we provide summaries of various well-known pre- -trained LLMs with significant discoveries, changing the course -of research and development in NLP. These LLMs have consid- -erably improved the performance in NLU and NLG domains, -and are widely fine-tuned for downstream tasks. Moreover, We -also identify key findings and insights of pre-trained LLMs in -Table 1 and 2 that improve their performance. -3.1.1. General Purpose -T5 [10]: An encoder-decoder model employing a unified text- -to-text training for all NLP problems is shown in Figure 7. T5 -places layer normalization outside the residual path in a conven- -tional transformer model [64]. It uses masked language mod- -eling as a pre-training objective where spans (consecutive to- -kens) are replaced with a single mask instead of separate masks -for each token. This type of masking speeds up the training as -it produces shorter sequences. After pre-training, the model is -fine-tuned using adapter layers [106] for downstream tasks. -GPT-3 [6]: The GPT-3 architecture is the same as the GPT- -2 [5] but with dense and sparse attention in transformer layers -similar to the Sparse Transformer [67]. It shows that large mod- -els can train on larger batch sizes with a lower learning rate to -decide the batch size during training, GPT-3 uses the gradient -noise scale as in [107]. Overall, GPT-3 increases model param- -eters to 175B showing that the performance of large language -7 - ---- Page 8 --- -Figure 7: Unified text-to-text training example, source image from [10]. -Figure 8: The image is the article of [108], showing an example of PanGu- α -architecture. -models improves with the scale and is competitive with the fine- -tuned models. -mT5 [11]: A multilingual T5 model [10] trained on the mC4 -dataset with 101 languages. The dataset is extracted from the -public common crawl scrape. The model uses a larger vocab- -ulary size of 250,000 to cover multiple languages. To avoid -over-fitting or under-fitting for a language, mT5 employs a data -sampling procedure to select samples from all languages. The -paper suggests using a small amount of pre-training datasets, -including all languages when fine-tuning for a task using En- -glish language data. This allows the model to generate correct -non-English outputs. -PanGu-α[108]: An autoregressive model that has a query -layer at the end of standard transformer layers, example shown -in Figure 8, to predict the next token. Its structure is similar to -the transformer layer but with an additional embedding for the -next position in the attention mechanism, given in Eq. 3. -a=pnWq -hWk -hT HT -L (3) -CPM-2 [12]: Cost-e fficient Pre-trained language Models -(CPM-2) pre-trains bilingual (English and Chinese) 11B and -198B mixture-of-experts (MoE) models on the WuDaoCor- -pus [109] dataset. The tokenization process removes “_” white -space tokens in the sentencepiece tokenizer. The models are -trained with knowledge inheritance, starting with only the Chi- -nese language in the first stage and then adding English and -Chinese data. This trained model gets duplicated multiple times -to initialize the 198B MoE model. Moreover, to use the model -for downstream tasks, CPM-2 experimented with both com-plete fine-tuning and prompt fine-tuning as in [40] where only -prompt-related parameters are updated by inserting prompts at -various positions, front, middle, and back. CPM-2 also pro- -poses the INFMOE, a memory-e fficient framework with a strat- -egy to dynamically o ffload parameters to the CPU for inference -at a 100B scale. It overlaps data movement with inference com- -putation for lower inference time. -ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi- -task learning to build a modular architecture using Transformer- -XL [111] as the backbone. The universal representation mod- -ule is shared by all the tasks, which serve as the basic block -for task-specific representation modules, which are all trained -jointly for natural language understanding, natural language -generation, and knowledge extraction. This LLM is primar- -ily focused on the Chinese language. It claims to train on the -largest Chinese text corpora for LLM training, and achieved -state-of-the-art in 54 Chinese NLP tasks. -Jurassic-1 [112]: A pair of auto-regressive language mod- -els, including a 7B-parameter J1-Large model and a 178B- -parameter J1-Jumbo model. The training vocabulary of -Jurassic-1 comprise word pieces, complete words, and multi- -word expressions without any word boundaries, where possible -out-of-vocabulary instances are interpreted as Unicode bytes. -Compared to the GPT-3 counterparts, the Jurassic-1 models -apply a more balanced depth-to-width self-attention architec- -ture [113] and an improved tokenizer for a faster prediction -based on broader resources, achieving a comparable perfor- -mance in zero-shot learning tasks and a superior performance in -few-shot learning tasks given the ability to feed more examples -as a prompt. -HyperCLOVA [114]: A Korean language model with GPT-3 -architecture. -Yuan 1.0 [115]: Trained on a Chinese corpus with 5TB of -high-quality text collected from the Internet. A Massive Data -Filtering System (MDFS) built on Spark is developed to pro- -cess the raw data via coarse and fine filtering techniques. To -speed up the training of Yuan 1.0 to save energy expenses and -carbon emissions, various factors that improve the performance -of distributed training are incorporated in architecture and train- -ing: like increasing the hidden state size improves pipeline and -tensor parallelism performance, larger micro batches improve -pipeline parallelism performance, and larger global batch size -improve data parallelism performance. In practice, the Yuan 1.0 -model performs well on text classification, Winograd Schema, -natural language inference, and reading comprehension tasks. -Gopher [116]: The Gopher family of models ranges from -44M to 280B parameters in size to study the e ffect of scale -on the LLMs performance. The 280B model beats GPT-3 [6], -Jurrasic-1 [112], MT-NLG [117], and others on 81% of the -evaluated tasks. -ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends ERNIE 3.0 -by training a larger model with 26x the number of parameters -of the latter. This bigger model outperformed other state-of-the- -art models in 68 NLP tasks. LLMs produce text with incorrect -facts. In order to have control of the generated text with fac- -tual consistency, ERNIE 3.0 Titan adds another task, Credible -and Controllable Generations , to its multi-task learning setup. -8 - ---- Page 9 --- -It introduces additional self-supervised adversarial and control- -lable language modeling losses to the pre-training step, which -enables ERNIE 3.0 Titan to beat other LLMs in their manually -selected Factual QA task set evaluations. -GPT-NeoX-20B [118]: An auto-regressive model that largely -follows GPT-3 with a few deviations in architecture design, -trained on the Pile dataset without any data deduplication. GPT- -NeoX has parallel attention and feed-forward layers in a trans- -former block, given in Eq. 4, that increases throughput by 15%. -It uses rotary positional embedding [66], applying it to only -25% of embedding vector dimension as in [119]. This reduces -the computation without performance degradation. As opposed -to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B -uses only dense layers. The hyperparameter tuning at this scale -is difficult; therefore, the model chooses hyperparameters from -the method [6] and interpolates values between 13B and 175B -models for the 20B model. The model training is distributed -among GPUs using both tensor and pipeline parallelism. -x+Attn(LN1(x))+FF(LN2(x)) (4) -OPT [14]: It is a clone of GPT-3, developed to open-source -a model that replicates GPT-3 performance. Training of OPT -employs dynamic loss scaling [120] and restarts from an earlier -checkpoint with a lower learning rate whenever loss divergence -is observed. Overall, the performance of OPT-175B models is -comparable to the GPT3-175B model. -BLOOM [13]: A causal decoder model trained on the ROOTS -corpus to open-source an LLM. The architecture of BLOOM is -shown in Figure 9, with di fferences like ALiBi positional em- -bedding, an additional normalization layer after the embedding -layer as suggested by the bitsandbytes1library. These changes -stabilize training with improved downstream performance. -GLaM [91]: Generalist Language Model (GLaM) represents a -family of language models using a sparsely activated decoder- -only mixture-of-experts (MoE) structure [121, 90]. To gain -more model capacity while reducing computation, the experts -are sparsely activated where only the best two experts are used -to process each input token. The largest GLaM model, GLaM -(64B/64E), is about 7×larger than GPT-3 [6], while only part of -the parameters are activated per input token. The largest GLaM -(64B/64E) model achieves better overall results as compared -to GPT-3 while consuming only one-third of GPT-3’s training -energy. -MT-NLG [117]: A 530B causal decoder based on the GPT- -2 architecture that has roughly 3 ×GPT-3 model parameters. -MT-NLG is trained on filtered high-quality data collected from -various public datasets and blends various types of datasets in a -single batch, which beats GPT-3 on several evaluations. -Chinchilla [96]: A causal decoder trained on the same dataset -as the Gopher [116] but with a little di fferent data sampling -distribution (sampled from MassiveText). The model architec- -ture is similar to the one used for Gopher, with the exception of -AdamW optimizer instead of Adam. Chinchilla identifies the -1https: //github.com /TimDettmers /bitsandbytes -Figure 9: The BLOOM architecture example sourced from [13]. -relationship that model size should be doubled for every dou- -bling of training tokens. Over 400 language models ranging -from 70 million to over 16 billion parameters on 5 to 500 bil- -lion tokens are trained to get the estimates for compute-optimal -training under a given budget. The authors train a 70B model -with the same compute budget as Gopher (280B) but with 4 -times more data. It outperforms Gopher [116], GPT-3 [6], and -others on various downstream tasks, after fine-tuning. -AlexaTM [122]: An encoder-decoder model, where encoder -weights and decoder embeddings are initialized with a pre- -trained encoder to speed up training. The encoder stays frozen -for the initial 100k steps and is later unfrozen for end-to-end -training. The model is trained on a combination of denoising -and causal language modeling (CLM) objectives, concatenat- -ing a [ CLM ] token at the beginning for mode switching. Dur- -ing training, the CLM task is applied for 20% of the time, which -improves the in-context learning performance. -PaLM [15]: A causal decoder with parallel attention and -feed-forward layers similar to Eq. 4, speeding up training by -a factor of 15. Additional changes to the conventional trans- -former model include SwiGLU activation, RoPE embeddings, -multi-query attention that saves computation cost during decod- -ing, and shared input-output embeddings. During training, loss -spiking was observed, and to fix it, model training was restarted -from a 100-step earlier checkpoint by skipping 200-500 batches -around the spike. Moreover, the model was found to memo- -rize around 2.4% of the training data at the 540B model scale, -whereas this number was lower for smaller models. -PaLM-2 [123]: A smaller multi-lingual variant of PaLM, -trained for larger iterations on a better quality dataset. PaLM- -2 shows significant improvements over PaLM, while reducing -training and inference costs due to its smaller size. To lessen -toxicity and memorization, it appends special tokens with a -fraction of pre-training data, which shows a reduction in gener- -ating harmful responses. -U-PaLM [124]: This method trains PaLM for 0.1% addi- -tional compute with the UL2 (also named as UL2Restore) ob- -jective [125], using the same dataset it outperforms the baseline -significantly on various NLP tasks, including zero-shot, few- -shot, commonsense reasoning, CoT, etc. Training with UL2R -involves converting a causal decoder PaLM to a non-causal de- -coder PaLM and employing 50% sequential denoising, 25% -regular denoising, and 25% extreme denoising loss functions. -9 - ---- Page 10 --- -UL2 [125]: An encoder-decoder architecture trained using a -mixture of denoisers (MoD) objective. Denoisers include 1) -R-Denoiser: a regular span masking, 2) S-Denoiser: which cor- -rupts consecutive tokens of a large sequence and 3) X-Denoiser: -which corrupts a large number of tokens randomly. During pre- -training, UL2 includes a denoiser token from R,S,Xto rep- -resent a denoising setup. It helps improve fine-tuning perfor- -mance for downstream tasks that bind the task to one of the up- -stream training modes. This MoD style of training outperforms -the T5 model on many benchmarks. -GLM-130B [33]: GLM-130B is a bilingual (English and Chi- -nese) model trained using an auto-regressive mask infilling pre- -training objective similar to the GLM [126]. This training style -makes the model bidirectional as compared to GPT-3, which is -unidirectional. As opposed to GLM, the training of GLM-130B -includes a small amount of multi-task instruction pre-training -data (5% of the total data) along with self-supervised mask in- -filling. To stabilize the training, it applies embedding layer gra- -dient shrink. -LLaMA [127, 21]: A set of decoder-only language models -varying from 7B to 70B parameters. LLaMA models series is -the most famous among the community for parameter e fficiency -and instruction tuning. -LLaMA-1 [127]: Implements e fficient causal attention [128] -by not storing and computing masked attention weights and -key/query scores. Another optimization is reducing the number -of activations recomputed in the backward pass, as in [129]. -LLaMA-2 [21]: This work is more focused on fine-tuning a -safer and better LLaMA-2-Chat model for dialogue generation. -The pre-trained model has 40% more training data with a larger -context length and grouped-query attention. -LLaMA-3 /3.1[130]: A collection of models trained on a -seven times larger dataset as compared to LLaMA-2 with dou- -ble the context length, outperforming its previous variants and -other models. -PanGu- Σ[92]: An autoregressive model with parameters -copied from PanGu- αand extended to a trillion scale with Ran- -dom Routed Experts (RRE), the architectural diagram is shown -in Figure 10. RRE is similar to the MoE architecture, with -distinctions at the second level, where tokens are randomly -routed to experts in a domain instead of using a learnable gat- -ing method. The model has bottom layers densely activated and -shared across all domains, whereas top layers are sparsely ac- -tivated according to the domain. This training style allows for -extracting task-specific models and reduces catastrophic forget- -ting e ffects in the case of continual learning. -Mixtral8x22b [131]: A mixture-of-experts (MoE) model with -eight distinct experts routes each token to two experts at each -layer and combines the outputs additively. -Snowflake Arctic [132]: Arctic LLM is a hybrid of dense and -mixture-of-experts (MoE) architecture. The MoE (128 ×3.66B -MLP experts) is parallel to the dense transformer (10B) with -only two experts activated. The model has many experts, com- -pared to other MoE LLMs [131, 133], to increase the model -capacity and provide an opportunity to choose among many ex- -perts for a diverse configuration. The model has 480B param- -eters, and only 17B are active during a forward pass, reducingthe computation significantly. -Grok [133, 134]: Grok is a family of LLMs including Grok-1 -and Grok-1.5, released by XAI. -Grok-1 [133]: Grok-1 is a 314B parameters language MoE -model (eight experts), where two experts are activated per to- -ken. -Grok-1.5 [134]: Grok-1.5 is a multi-modal LLM with a larger -context length and improved performance. -Gemini [135, 136]: Gemini replaces Bard (based on PaLM) -with multi-modal capabilities and significant language model- -ing performance improvements. -Gemini-1 [135]: The first-ever auto-regressive model to -achieve human-level capabilities on the MMLU benchmark. -Gemini-1.5 [136]: A multi-modal LLM with MoE architec- -ture builds on the findings of Gemini-1. The model has a 2M -context window and can reason over information up to 10M -tokens. Such large context windows were never achieved pre- -viously and shown to have a huge impact on performance gain. -Nemotron-4 340B [137]: A decoder-only model that has been -aligned on 98% synthetic data and only 2% manually annotated -data. Utilizing synthetic data at a large proportion improves the -model performance significantly. The paper suggested intro- -ducing alignment data with a smaller subset of previously seen -data during the late stage of the model pre-training, enabling the -smooth transition from the pre-trained stage to the final train- -ing stage. To train better instruction-following models, weaker -models are trained into stronger models iteratively. The syn- -thetic data generated by the weaker instruction-tuned model is -used to train a base model which is later supervised fine-tuned -outperforming the weaker model. -DeepSeek [138]: DeepSeek studies the LLMs scaling laws -in detail to determine the optimal non-embedding model size -and training data. The experiments were performed for 8 bud- -gets ranging from 1e17to 3e20training FLOPs. Each compute -budget was tested against ten di fferent models /data scales. The -batch size and learning rates were also fitted for the given com- -pute budget finding that the batch size should increase with -the increased compute budget while decreasing the learning -rate. Following are the equations for the optimal batch-size ( B), -learning rate ( η), model size ( M), and data ( D): -Bopt=0.2920.C0.3271 -ηopt=0.3118.C−0.1250 -Mopt=Mbase.Ca -Dopt=Dbase.Cb -Mbase=0.1715,Dbase=5.8316,a=0.5243,b=0.4757(5) -DeepSeek-v2 [139]: An MoE model that introduces multi- -head latent attention (MLA) to reduce inference costs, by com- -pressing Key-Value (KV) cache into a latent vector. MLA -achieves better performance than multi-head attention (MHA), -and other e fficient attention mechanisms such as grouped query -attention (GQA), multi-query attention (MQA), etc. Because -of MLA, DeepSeek-v2 achieves 5.76 times faster inference -throughput as compared to DeepSeek [138]. -10 - ---- Page 11 --- -3.1.2. Coding -CodeGen [140]: CodeGen has a similar architecture to -PaLM [15], i.e., parallel attention, MLP layers, and RoPE em- -beddings. The model is trained on both natural language and -programming language data sequentially (trained on the first -dataset, then the second, and so on) on the following datasets -1) PILE, 2) BIGQUERY , and 3) BIGPYTHON. CodeGen pro- -posed a multi-step approach to synthesizing code. The purpose -is to simplify the generation of long sequences where the previ- -ous prompt and generated code are given as input with the next -prompt to generate the next code sequence. CodeGen open- -source a Multi-Turn Programming Benchmark (MTPB) to eval- -uate multi-step program synthesis. -Codex [141]: This LLM is trained on a subset of public Python -Github repositories to generate code from docstrings. Com- -puter programming is an iterative process where the programs -are often debugged and updated before fulfilling the require- -ments. Similarly, Codex generates 100 versions of a program -by repetitive sampling for a given description, which produces -a working solution for 77.5% of the problems passing unit tests. -Its powerful version powers Github Copilot2. -AlphaCode [142]: A set of large language models, ranging -from 300M to 41B parameters, designed for competition-level -code generation tasks. It uses the multi-query attention [143] to -reduce memory and cache costs. Since competitive program- -ming problems highly require deep reasoning and an under- -standing of complex natural language algorithms, the Alpha- -Code models are pre-trained on filtered GitHub code in popular -languages and then fine-tuned on a new competitive program- -ming dataset named CodeContests. The CodeContests dataset -mainly contains problems, solutions, and test cases collected -from the Codeforces platform3. The pre-training employs stan- -dard language modeling objectives, while GOLD [144] with -tempering [145] serves as the training objective for the fine- -tuning on CodeContests data. To evaluate the performance of -AlphaCode, simulated programming competitions are hosted -on the Codeforces platform: overall, AlphaCode ranks at the -top 54.3% among over 5000 competitors, where its Codeforces -rating is within the top 28% of recently participated users. -CodeT5 +[34]: CodeT5 +is based on CodeT5 [146], with -shallow encoder and deep decoder, trained in multiple stages -initially unimodal data (code) and later bimodal data (text-code -pairs). Each training stage has di fferent training objectives and -activates di fferent model blocks encoder, decoder, or both ac- -cording to the task. The unimodal pre-training includes span -denoising and CLM objectives, whereas bimodal pre-training -objectives contain contrastive learning, matching, and CLM for -text-code pairs. CodeT5 +adds special tokens with the text to -enable task modes, for example, [ CLS ] for contrastive loss, -[Match ] for text-code matching, etc. -StarCoder [147]: A decoder-only model with the SantaCoder -architecture, employing Flash attention to scale up the context -length to 8k. The StarCoder trains an encoder to filter names, -2https: //github.com /features /copilot -3https: //codeforces.com /emails, and other personal data from the training data. Its fine- -tuned variant outperforms PaLM, LLaMA, and LAMDA on -HumanEval and MBPP benchmarks. -3.1.3. Scientific Knowledge -Galactica [148]: A large curated corpus of human scientific -knowledge with 48 million papers, textbooks, lecture notes, -millions of compounds and proteins, scientific websites, en- -cyclopedias, and more are trained using the metaseq library3, -which is built on PyTorch and fairscale [149]. The model wraps -reasoning datasets with the token to provide step-by- -step reasoning context to the model, which has been shown to -improve the performance on reasoning tasks. -3.1.4. Dialog -LaMDA [150]: A decoder-only model pre-trained on pub- -lic dialog data, public dialog utterances, and public web doc- -uments, where more than 90% of the pre-training data is in -English. LaMDA is trained with the objective of producing re- -sponses that exhibit high levels of quality, safety, and grounded- -ness. To achieve this, discriminative and generative fine-tuning -techniques are incorporated to enhance the model’s safety and -quality aspects. As a result, the LaMDA models can be utilized -as a general language model performing various tasks. -3.1.5. Finance -BloombergGPT [151]: A non-causal decoder model trained -using both financial (“FINPILE” from the Bloomberg archive) -and general-purpose datasets. The model’s architecture is sim- -ilar to the BLOOM [13] and OPT [14]. It allocates 50B param- -eters to di fferent blocks of the model using the approach [113]. -For e ffective training, BloombergGPT packs documents to- -gether with <|endo f text|>to use the maximum sequence -length, uses warmup batch size starting from 1024 to 2048, and -manually reduces the learning rate multiple times during the -training. -Xuan Yuan 2.0 [152]: A Chinese financial chat model with -BLOOM’s [13] architecture trained on a combination of general -purpose, financial, general purpose instructions, and financial -institutions datasets. Xuan Yuan 2.0 combined the pre-training -and fine-tuning stages to avoid catastrophic forgetting. -3.2. Fine-Tuned LLMs -Pre-trained LLMs have excellent generalization abilities to -unseen tasks. However, because they are generally trained with -the objective of next token prediction, LLMs have limited ca- -pacity to follow user intent and are prone to generate unethical, -toxic or inaccurate responses [20]. For their e ffective utiliza- -tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and -generate safe responses [20], which also results in increasing -zero-shot, few-shot, and cross-task generalization [97, 16, 18], -with minimal compute increment, e.g., 0.2% of the total pre- -training for PaLM 540B [16]. -We review various fine-tuned LLMs and strategies for e ffective -fine-tuning in this section. -11 - ---- Page 12 --- -Table 1: Noteworthy findings and insights of pre-trained Large Language Models. -Models Findings & Insights -T5•Encoder and decoder with shared parameters perform equivalently when parameters are not shared -•Fine-tuning model layers (adapter layers) work better than the conventional way of training on only -classification layers -GPT-3•Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta- -learners -mT5•Large multi-lingual models perform equivalently to single language models on downstream tasks. -However, smaller multi-lingual models perform worse -PanGu-α •LLMs have good few shot capabilities -CPM-2•Prompt fine-tuning requires updating very few parameters while achieving performance compara- -ble to full model fine-tuning -•Prompt fine-tuning takes more time to converge as compared to full model fine-tuning -•Inserting prompt tokens in-between sentences can allow the model to understand relations between -sentences and long sequences -•In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator -(aggregate information with the input text) for the model -ERNIE 3.0•A modular LLM architecture with a universal representation module and task-specific representa- -tion module helps in the finetuning phase -•Optimizing the parameters of a task-specific representation network during the fine-tuning phase is -an efficient way to take advantage of the powerful pre-trained model -Jurassic-1•The performance of LLM is highly related to the network size -•To improve runtime performance, more operations can be performed in parallel (width) rather than -sequential (depth) -•To efficiently represent and fit more text in the same context length, the model uses a larger vo- -cabulary to train a SentencePiece tokenizer without restricting it to word boundaries. This further -benefits in few-shot learning tasks -HyperCLOV A•By employing prompt-based tuning, the performances of models can be improved, often surpassing -those of state-of-the-art models when the backward gradients of inputs are accessible -Yuan 1.0•The model architecture that excels in pre-training and fine-tuning cases may exhibit contrasting -behavior in zero-shot and few-shot learning -Gopher •Relative encodings enable the model to evaluate for longer sequences than training. -ERNIE 3.0 Titan•Additional self-supervised adversarial loss to distinguish between real and generated text improves -the model performance as compared to ERNIE 3.0 -GPT-NeoX-20B•Parallel attention +FF layers speed-up training 15% with the same performance as with cascaded -layers -•Initializing feed-forward output layers before residuals with scheme in [153] avoids activations -from growing with increasing depth and width -•Training on Pile outperforms GPT-3 on five-shot -Table Continued on Next Page -12 - ---- Page 13 --- -Models Findings & Insights -OPT•Restart training from an earlier checkpoint with a lower learning rate if loss diverges -•Model is prone to generate repetitive text and stuck in a loop -Galactica•Galactica’s performance has continued to improve across validation set, in-domain, and out-of- -domain benchmarks, even with multiple repetitions of the corpus, which is superior to existing -research on LLMs -•A working memory token approach can achieve strong performance over existing methods on -mathematical MMLU and MATH benchmarks. It sets a new state-of-the-art on several downstream -tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%) -GLaM•The model capacity can be maintained at reduced computation by replacing the feed-forward layer -in each transformer layer with a mixture-of-experts (MoE) -•The model trained on filtered data shows consistently better performances on both NLG and NLU -tasks, where the e ffect of filtering is more significant on the former tasks -•Filtered pretraining corpora play a crucial role in the generation capability of LLMs, especially for -the downstream tasks -•The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in -the MoE layer. Given a fixed budget of computation, more experts contribute to a better perfor- -mance -LaMDA •The model can be fine-tuned to learn to call di fferent external information resources and tools -AlphaCode•For higher e ffectiveness and e fficiency, a transformer model can be asymmetrically constructed -with a shallower encoder and a deeper decoder -•To achieve better performances, it is necessary to employ strategies such as massively scaling -upsampling, followed by the filtering and clustering of samples into a compact set -•The utilization of novel sampling-e fficient transformer architectures designed to facilitate large- -scale sampling is crucial -•Simplifying problem descriptions can e ffectively improve the model’s performance -Chinchilla•The model size and the number of training tokens should be scaled proportionately: for each dou- -bling of the model size, the number of training tokens should be doubled as well -PaLM•English-centric models produce better translations when translating to English as compared to non- -English -•Generalized models can have equivalent performance for language translation to specialized small -models -•Larger models have a higher percentage of training data memorization -•Performance has not yet saturated even at 540B scale, which means larger models are likely to -perform better -AlexaTM•Encoder-decoder architecture is more suitable to train LLMs given bidirectional attention to the -context than decoder-only -•Causal Language Modeling (CLM) task can be added to benefit the model with e fficient in-context -learning -•Placing layer norm at the beginning of each transformer layer improves the training stability -Table Continued on Next Page -13 - ---- Page 14 --- -Models Findings & Insights -U-PaLM•Training with a mixture of denoisers outperforms PaLM when trained further for a few more FLOPs -•Training with a mixture of denoisers improves the infilling ability and open-ended text generation -diversity -UL2•Mode switching training enables better performance on downstream tasks -•CoT prompting outperforms standard prompting for UL2 -GLM-130B•Pre-training data with a small proportion of multi-task instruction data improves the overall model -performance -CodeGen•Multi-step prompting for code synthesis leads to a better user intent understanding and code gen- -eration -LLaMA•A constant performance improvement is observed when scaling the model -•Smaller models can achieve good performances with more training data and computing time -PanGu- Σ•Sparse models provide the benefits of large models at a lower computation cost -•Randomly Routed Experts reduces catastrophic forgetting e ffects which in turn is essential for -continual learning -•Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is -cost-e fficient while maintaining a performance similar to the original -BloombergGPT•Pre-training with general-purpose and task-specific data improves task performance without hurt- -ing other model capabilities -XuanYuan 2.0 •Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting -CodeT5 +•Causal LM is crucial for a model’s generation capability in encoder-decoder architectures -•Multiple training objectives like span corruption, Causal LM, matching, etc complement each other -for better performance -StarCoder •HHH prompt by Anthropic allows the model to follow instructions without fine-tuning -LLaMA-2•Model trained on unfiltered data is more toxic but may perform better on downstream tasks after -fine-tuning -•Model trained on unfiltered data requires fewer samples for safety alignment -PaLM-2•Data quality is important to train better models -•Model and data size should be scaled with 1:1 proportions -•Smaller models trained for larger iterations outperform larger models -LLaMA-3 /3.1•Increasing batch size gradually stabilizes the training without loss spikes -•High-quality data at the final stages of training improves the model performance -•Increasing model context length windows step-wise allows it to better adapt to various sequence -lengths -Nemotron-40B•Model aligned iteratively on synthetic data with data generated from the previously aligned model -achieves competitive performance -DeepSeek •Batch size should increase with the increase in compute budget while decreasing the learning rate -DeepSeek-v2•Mult-head latent attention (MLA) performs better than multi-head attention (MHA) while requiring -a significantly smaller KV cache, therefore achieving faster data generation -14 - ---- Page 15 --- -Table 2: Key insights and findings from the study of instruction-tuned Large Language Models. -Models Findings & Insights -T0•Multi-task prompting enables zero-shot generalization and outperforms baselines -•Even a single prompt per dataset task is enough to improve performance -WebGPT•To aid the model in e ffectively filtering and utilizing relevant information, human labelers play a -crucial role in answering questions regarding the usefulness of the retrieved documents -•Interacting a fine-tuned language model with a text-based web-browsing environment can improve -end-to-end retrieval and synthesis via imitation learning and reinforcement learning -•Generating answers with references can make labelers easily judge the factual accuracy of answers -Tk-INSTRUCT•Instruction tuning leads to a stronger generalization of unseen tasks -•More tasks improve generalization whereas only increasing task instances does not help -•Supervised trained models are better than generalized models -•Models pre-trained with instructions and examples perform well for di fferent types of inputs -mT0 and BLOOMZ•Instruction tuning enables zero-shot generalization to tasks never seen before -•Multi-lingual training leads to even better zero-shot generalization for both English and non- -English -•Training on machine-translated prompts improves performance for held-out tasks with non-English -prompts -•English only fine-tuning on multilingual pre-trained language model is enough to generalize to -other pre-trained language tasks -OPT-IML•Creating a batch with multiple task examples is important for better performance -•Only example proportional sampling is not enough, training datasets should also be proportional -for better generalization /performance -•Fully held-out and partially supervised tasks performance improves by scaling tasks or categories -whereas fully supervised tasks have no e ffect -•Including small amounts i.e. 5% of pretraining data during fine-tuning is e ffective -•Only 1% reasoning data improves the performance, adding more deteriorates performance -•Adding dialogue data makes the performance worse -Sparrow•Labelers’ judgment and well-defined alignment rules help the model generate better responses -•Good dialogue goals can be broken down into detailed natural language rules for the agent and the -raters -•The combination of reinforcement learning (RL) with reranking yields optimal performance in -terms of preference win rates and resilience against adversarial probing -Flan•Finetuning with CoT improves performance on held-out tasks -•Fine-tuning along with CoT data improves reasoning abilities -•CoT tuning improves zero-shot reasoning -•Performance improves with more tasks -•Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models -•Improving the model’s performance with instruction tuning is compute-e fficient -•Multitask prompting enables zero-shot generalization abilities in LLM -WizardCoder •Fine-tuning with re-written instruction-tuning data into a complex set improves performance -LLaMA-2-Chat•Model learns to write safe responses with fine-tuning on safe demonstrations, while additional -RLHF step further improves model safety and make it less prone to jailbreak attacks -LIMA •Less high quality data is enough for fine-tuned model generalization -15 - ---- Page 16 --- -Figure 10: This example illustrates the PanGu-Parchitecture, as depicted in -the image sourced from [92]. -3.2.1. Instruction-Tuning with Manually Created Datasets -Numerous hand-crafted instruction-tuning datasets with -different design choices are proposed in the literature to -instruction-tune LLMs. The performance of fine-tuned LLMs -depends on multiple factors, such as dataset, instruction diver- -sity, prompting templates, model size, and training objectives. -Keeping this in view, diverse fine-tuned models have emerged -in the literature using manually created datasets. -The models T0 [17] and mT0 (multi-lingual) [154] employ -templates to convert existing datasets into prompt datasets. -They have shown improvements in generalization to zero-shot -and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model -with in-context instructions to study generalization on unseen -tasks when given in-context instructions during test time. The -model outperformed Instruct-GPT, despite being smaller in -size, i.e., 11B parameters as compared to 175B of GPT-3. -Increasing Tasks and Prompt Setups: Zero-shot and few-shot -performance improves significantly by expanding task collec- -tion and prompt styles. OPT-IML [97] and Flan [16] curated -larger 2k and 1.8k task datasets, respectively. While increasing -task size alone is not enough, OPT-IML and Flan add more -prompting setups in their datasets, zero-shot, few-shot, and -CoT. In continuation, CoT Collection [101] fine-tunes Flan-T5 -further on 1.88M CoT samples. Another method [102] uses -symbolic tasks with tasks in T0, Flan, etc. -3.2.2. Instruction-Tuning with LLMs Generated Datasets -Generating an instruction-tuning dataset requires carefully -writing instructions and input-output pairs, which are often -written by humans, smaller in size, and less diverse. To over- -come this, self-instruct [19] proposed an approach to prompt -available LLMs to generate instruction-tuning datasets. Self- -instruct outperformed models trained on manually created -dataset SUPER-NATURALINSTRUCTIONS (a dataset with -1600+tasks) [18] by 33%. It starts with a seed of 175 tasks, 1 -instruction, and 1 sample per task and iteratively generates new -instructions (52k) and instances (82k input-output pairs) using -Figure 11: An example image shows an instance of the Flan training paradigm, -taken from [16]. -GPT-3 [6]. Contrary to this, Dynosaur [155] uses the meta-data -of datasets on Huggingface to prompt LLMs to generate multi- -ple task instruction-tuning datasets. -LLaMA Tuned: Various models in the literature instruction- -tune LLaMA [156] with GPT-3 [6] or GPT-4 [157] gener- -ated datasets. Among these, Alpaca [158], Vicuna [159], -and LLaMA-GPT-4 [160] are a few general-purpose fine-tuned -models, where Alpaca is trained on 52k samples from text- -davinci-003, Vicuna on 70k samples from ShareGPT.com, and -LLaMA-GPT-4 by re-creating Alpaca instructions from GPT- -4. Goat [161] fine-tunes LLaMA for arithmetic tasks (1 million -samples) by generating data from ChatGPT and outperforms -GPT-4, PaLM, BLOOM, OPT, etc., attributing its success to the -LLaMA’s consistent tokenization of numbers. HuaTuo [162] is -a medical knowledge model, fine-tuned with a generated QA -dataset of 8k instructions. -Complex Instructions: Evol-Instruct [163, 164] prompts LLMs -to convert given instructions into a more complex set. The in- -structions are iteratively evolved with re-writing instructions in -complex wording and creating new instructions. With this style -of automated instruction generation, WizardLM [163] (fine- -tuned LLaMA on 250k instructions), outperforms Vicuna and -Alpaca, and WizardCoder [164] (fine-tuned StarCoder) beats -Claude-Plus, Bard, and others. -3.2.3. Aligning with Human Preferences -Incorporating human preferences into LLMs presents a -significant advantage in mitigating undesirable behaviors and -ensuring accurate outputs. The initial work on alignment, such -as InstructGPT [20] aligns GPT-3 using a 3-step approach, -instruction-tuning, reward modeling, and fine-tuning with -reinforcement learning (RL). The supervised fine-tuned GPT-3 -on demonstrations is queried to generate responses, which -human labelers rank according to human values, and a reward -model is trained on the ranked data. Lastly, the GPT-3 is trained -with proximal policy optimization (PPO) using rewards on the -generated data from the reward model. LLaMA 2-Chat [21] -improves alignment by dividing reward modeling into help- -fulness and safety rewards and using rejection sampling in -addition to PPO. The initial four versions of LLaMA 2-Chat -are fine-tuned with rejection sampling and then with PPO on -16 - ---- Page 17 --- -top of rejection sampling. -Aligning with Supported Evidence: This style of alignment -allows the model to generate responses with proofs and facts, -reduces hallucination, and assists humans more e ffectively, -which increases trust in the model’s output. Similar to -the RLHF training style, a reward model is trained to rank -generated responses containing web citations in answers -to questions, which is later used to train the model, as in -GopherCite [165], WebGPT [166], and Sparrow [167]. The -ranking model in Sparrow [167] is divided into two branches, -preference reward and rule reward, where human annotators -adversarial probe the model to break a rule. These two rewards -together rank a response to train with RL. -Aligning Directly with SFT: The PPO in the RLHF pipeline -is complex, memory-intensive, and unstable, requiring mul- -tiple models, reward, value, policy, and reference models. -Avoiding this sophisticated alignment pipeline is possible by -incorporating minimal changes in the supervised fine-tuning -(SFT) pipeline as in [168, 169, 170], with better or compa- -rable performance to PPO. Direct preference optimization -(DPO) [168] trains a model directly on the human-preferred -responses to maximize the likelihood of preferred against -unpreferred responses, with per-sample importance weight. -Reward ranked fine-tuning RAFT [169] fine-tunes the model -on ranked responses by the reward model. Preference ranking -optimization (PRO) [171] and RRHF [170] penalize the model -to rank responses with human preferences and supervised loss. -On the other hand, chain-of-hindsight (CoH) [172] provides -feedback to the model in language rather than reward, to learn -good versus bad responses. -Aligning with Synthetic Feedback: Aligning LLMs with -human feedback is slow and costly. The literature suggests a -semi-automated process to align LLMs by prompting LLMs to -generate helpful, honest, and ethical responses to the queries, -and fine-tuning using the newly created dataset. Constitutional -AI [173] replaces human feedback in RLHF with AI, calling -it RL from AI feedback (RLAIF). AlpacaFarm [174] designs -prompts to imitate human feedback using LLMs APIs. Oppo- -site to constitutional AI, AlpacaFarm injects noise in feedback -to replicate human mistakes. Self-Align [98] prompts the -LLM with ICL examples, instructing the LLM about what the -response should contain to be considered useful and ethical. -The same LLM is later fine-tuned with the new dataset. -Aligning with Prompts: LLMs can be steered with prompts to -generate desirable responses without training [175, 176]. The -self-correction prompting in [176] concatenates instructions -and CoT with questions, guiding the model to answer its -instruction following a strategy to ensure moral safety before -the actual answer. This strategy is shown to reduce the harm in -generated responses significantly. -Red-Teaming /Jailbreaking /Adversarial Attacks: LLMs -exhibit harmful behaviors, hallucinations, leaking personal in- -formation, and other shortcomings through adversarial probing. -The models are susceptible to generating harmful responses -even though they are aligned for safety [177, 178]. Red- -teaming is a common approach to address illicit outputs, where -the LLMs are prompted to generate harmful outputs [178, 179].The dataset collected through red-teaming is used to fine-tune -models for safety. While red-teaming largely relies on human -annotators, another work [180] red-team LLMs to find prompts -that lead to harmful outputs for other LLMs. -3.2.4. Continue Pre-Training -Although fine-tuning boosts a model’s performance, it leads -to catastrophic forgetting of previously learned information. -Concatenating fine-tuning data with a few randomly selected -pre-training samples in every iteration avoids network forget- -ting [181, 152]. This is also e ffective in adapting LLMs for -cases where fine-tuning data is small and the original capac- -ity is to be maintained. Prompt-based continued pre-training -(PCP) [182] trains the model with text and instructions related -to tasks and then finally instruction-tunes the model for down- -stream tasks. -3.2.5. Sample E fficiency -While fine-tuning data is generally many-fold smaller than -the pre-training data, it still has to be large enough for accept- -able performance [16, 97, 18] and requires proportional com- -puting resources. Studying the e ffects on performance with less -data, existing literature [183, 184] finds that models trained -on less data can outperform models trained with more data. -In [183], 25% of the total downstream data is found enough -for state-of-the-art performance. Selecting coreset-based 0.5% -of the total instruction-tuning data improves the model perfor- -mance by 2% in [184], as compared to the complete data tun- -ing. Less is more for alignment (LIMA) [185] uses only 1000 -carefully created demonstrations to fine-tune the model and has -achieved comparable performance to GPT-4. -3.3. Increasing Context Window -LLMs are trained with limited context windows due to ex- -pensive attention and high memory requirements. A model -trained on limited sequence lengths fails to generalize to unseen -lengths at inference time [186, 49]. Alternatively, LLMs with -ALiBi [65] positional encodings can perform zero-shot length -extrapolation. However, ALiBi has less expressive power [66] -and inferior performance on multiple benchmarks [46], and -many LLMs use RoPE positional embedding that is unable to -perform zero-shot extrapolation. A larger context length has -benefits such as a better understanding of longer documents, -more samples in in-context learning, execution of bigger rea- -soning processes, etc. Expanding context length during fine- -tuning is slow, ine fficient, and computationally expensive [49]. -Therefore, researchers employ various context window extrap- -olation techniques discussed below. -Position Interpolation: Rather than extrapolating, [49] shows -that interpolating position encodings within the pre-trained con- -text window are more e ffective. The work demonstrates that -only 1000 steps of fine-tuning are enough to achieve better re- -sults on larger windows without reducing performance com- -pared to the original context size. Gira ffe [46] uses power scal- -ing in RoPE, and YaRN [47] proposed NTK-aware interpola- -tion. -17 - ---- Page 18 --- -Efficient Attention Mechanism: Dense global attention is -one of the major constraints in training larger context win- -dow LLMs. Using e fficient attention variants, such as lo- -cal, sparse, and dilated attention, reduces the computation cost -significantly. LongT5 [48] proposes transient global atten- -tion (TGlobal), applying attention to local and global tokens -(windowed token averaging). The model replaces attention -in T5 [10] with TGlobal attention, pre-trains the model on -4098 sequence length, fine-tunes on larger window sizes, as -large as 16k, and improves task performance on longer inputs. -This shows the extrapolation ability of TGlobal attention with -only fine-tuning. COLT5 [187] uses two branches, one with -lightweight and the other with heavyweight attention and feed- -forward layers. All tokens are processed from the lightweight -branch, and only important tokens are routed to the heavy- -weight branch. LongNet [188] replaces standard attention with -dilated attention, expanding sequence length to 1 billion tokens. -LongLoRA [189] proposes shift-short attention, used during -fine-tuning to reduce dense attention costs. However, the model -during inference uses dense attention and achieves similar per- -formance as full attention fine-tuning. -Extrapolation without Training: LM-Infinite [186] and par- -allel context windows (PCW) [190] show length extrapolation -is possible using pre-trained LLMs. LM-Infinite suggested Λ- -shaped attention applied within the original context window -limits. Likewise, PCW chunks larger inputs into the pre-trained -context lengths and applies the same positional encodings to -each chunk. -3.4. Augmented LLMs -LLMs are capable of learning from the examples concate- -nated with the input, known as context augmentation, in- -context learning (ICL), or few-shot prompting. They show ex- -cellent generalization to unseen tasks with few-shot prompt- -ing, enabling LLMs to answer queries beyond the capacity ac- -quired during training [6, 55]. These emergent abilities allow -for adapting the model without fine-tuning—a costly process. -Aside from this, hallucination, producing inaccurate, unsafe, -or factually incorrect responses, is common for LLMs, which is -avoided by augmenting contextual data. While the user can pro- -vide in-context samples in the query [54, 32], here we specifi- -cally refer to the methods that access external storage program- -matically, calling them augmented LLMs. -The literature suggests various external memory designs to aug- -ment LLMs, long-term [191, 192, 193, 194], short-term [195], -symbolic [196], and non-symbolic [197, 198]. The memory -can be maintained in di fferent formats such as documents, vec- -tors, or databases. A few systems maintain intermediate mem- -ory representations to retain information across multiple iter- -ations [194, 192], while others extract important information -from the datasets and save it in memory for recall [199]. The -memory read and write operations are performed either with -or without LLMs cooperation [192, 200, 194, 201], acting as -a feedback signal in [195]. We discuss di fferent types of aug- -mented LLMs below. -Figure 12: A flow diagram of Retrieval Augmented LLMs. The retriever ex- -tracts a similar context to the input and forwards it to the LLM either in simple -language or encoded through Fusion-in-Decoder (FiD). Depending on the task, -retrieval and generation may repeat multiple times. -3.4.1. Retrieval Augmented LLMs -LLMs may have limited memory and outdated information, -leading to inaccurate responses. Retrieving relevant informa- -tion from external up-to-date storage enables the LLMs to -accurately answer with references and utilize more informa- -tion. With retrieval augmentation, smaller models have been -shown to perform at par with larger models. For instance, the -11B model can become competitive to 540B PaLM in [25] and -7.5B to 280B Gopher in [193]. Retrieval augmented language -modeling (RALM) has two major components, shown in -Figure 12, namely: 1) retriever and 2) language model. In -RALM, the retriever plays a crucial role in driving LLM -response, where incorrect information can steer LLMs to false -behavior. This leads to the development of various methods to -retrieve accurate information and fuse with the query for better -performance. -Zero-Shot Retrieval Augmentation: This kind of augmen- -tation keeps the original LLM architecture and weights -unchanged and uses BM25 [202], nearest neighbors, or frozen -pre-trained models like Bert [7] as a retriever. The retrieved -information is provided as input to the model for response -generation, shown to improve performance over LLMs without -retrieval [198, 203]. In some scenarios, multiple retrieval -iterations are required to complete the task. The output -generated in the first iteration is forwarded to the retriever -to fetch similar documents. Forward-looking active retrieval -(FLARE) [197] initially generates the response and corrects -the output by retrieving relevant documents if the response -contains low-confidence tokens. Similarly, RepoCoder [204] -fetches code snippets recursively for code completion. -Training with Retrieval Augmentation: To reduce failures in -retrieval augmentation generation (RAG), researchers train or -fine-tune retrievers and LLMs with a retrieval augmentation -pipeline. We discuss the literature below based on their focus -on the respective training processes of the pipeline. -Training LLM: Retrieval-enhanced transformer (RETRO) [193] -shows pre-training smaller LLMs with RAG pipeline outper- -forms larger LLMs, such as GPT-3 trained without RAG. -RETRO uses a 2-trillion token subset of MassiveText as -18 - ---- Page 19 --- -a database. The retrieval pipeline divides the input query -into subsets and retrieves relevant chunks from the database -for each subset, encoded together with input intermediate -representations for generating tokens. It uses cross-chunked -attention to attend to previous chunks auto-regressively. A -study on RETRO [205] shows models pre-trained without RAG -but fine-tuned using RAG lack the performance gains obtained -by pre-training with RAG. -Training Retriever: Quality of responses generated by LLMs -is highly dependent on the in-context examples. There- -fore, [206, 207, 208, 209] train retrievers to retrieve accurate -few-shot samples while keeping the LLM frozen for gener- -ation. Retrieved samples are ranked to build ground-truth -data to train retrievers with contrastive learning in [206, 208]. -RoBERTa is trained for downstream tasks in [207] for ICL -samples retrieval. REPLUG [209] trains the retriever with -supervised signals from the frozen LLM-generated outputs. -Training Retriever and LLM: Further benefits are achieved by -training both the retriever and the model in [25, 210, 211]. In -this case, the error propagates back to the retriever, updating -both the language model and the retriever. While masked -language modeling (MLM) is a common pre-training objec- -tive [25, 211], retrieval pre-trained transformer (RPT) [210] -used document chunk prediction as a pre-training objective for -long text modeling. -Encoded Context Augmentation: Concatenating retrieved -documents with the query becomes infeasible as the sequence -length and sample size grow. Encoding the context and fusing -it with the decoder (Fusion-in-Decoder) using cross-attention -makes it possible to augment more samples without increasing -computation costs significantly [212, 193, 210, 25]. -Web Augmented: Locally stored memory, but external to -LLM, has limited information. However, a large amount of -information is available on the internet, which gets updated -regularly. Rather than storing information locally, various -methods retrieve query-related context through a web search -and forward it to LLMs [213, 214, 166]. -3.4.2. Tool Augmented LLMs -While RAG relies on the retriever to provide context to the -LLM to answer queries, tool augmented LLMs capitalize on the -reasoning abilities of LLMs to iteratively plan by dividing tasks -into sub-tasks, selecting necessary tools, and taking actions to -complete the task [215, 216, 217, 27]. A generic pipeline of -tool-augmented LLMs is shown in Figure 13, where di fferent -modules in Figure 13 are selected in a loop until the task com- -pletion. -Zero-Shot Tool Augmentation: LLMs in-context learning and -reasoning abilities enable them to interact with tools with- -out training. Automatic reasoning and tool-use (ART) [217] -builds a task library with demonstrations of reasoning steps and -calling external tools. It retrieves similar task examples and -provides the context to the LLM for inference. Aside from -this, [218] shows tool documentation is enough to teach LLMs -to use tools without demonstrations. RestGPT [219] integrates -LLMs with RESTful APIs by decomposing tasks into planning -Figure 13: A basic flow diagram of tool augmented LLMs. Given an input and -a set of available tools, the model generates a plan to complete the task. The -tool augmented LLMs utilize di fferent modules iteratively, such as retriever, -tool execution, read-write to memory, feedback, etc., depending on the task. -and API selection steps. The API selector understands the API -documentation to select a suitable API for the task and plan the -execution. ToolkenGPT [220] uses tools as tokens by concate- -nating tool embeddings with other token embeddings. During -inference, the LLM generates the tool tokens representing the -tool call, stops text generation, and restarts using the tool exe- -cution output. -Training with Tool Augmentation: LLMs are trained to inter- -act with diverse tools, enhancing planning abilities to overcome -the limitations of zero-shot tool augmentation [221, 27, 222, -223]. Gorilla [221] instruction-tunes LLaMA with information -retrieval from API documentation. It uses the self-instruct [19] -data generation pipeline with GPT-4 by providing in-context -examples retrieved from API documentation. Tool augmented -language model (TALM) [27] fine-tunes T5 [10] for tool use -with a self-play approach, where it iteratively completes tool -manipulation tasks and includes them back in the training set. -ToolLLM [223] collects 16k APIs from RapidAPI. It samples -APIs from the list to generate an instruction-tuning dataset us- -ing ChatGPT in single-tool and multi-tool scenarios. For high- -quality datasets, ToolLLM suggested a depth-first search-based -decision tree (DFSDT) method to generate ground-truths with -diverse reasoning and planning. -Multimodal Tool Augmentation: The compositional reasoning -capacity of LLMs allows them to manipulate tools in multi- -modal settings [215, 216, 224]. Following the pipeline shown -in Figure 13, the LLM outlines a plan, generally executing in a -sequence: Plan→Tool selection→Execute→Inspect→ -Generate, to respond to the user query. Here, the database of -tools is rich in modalities, including text, images, etc. Many of -the multimodal tool augmentation systems employ multimodal -LLMs [31, 225, 224, 216], while others utilize single modality -19 - ---- Page 20 --- -LLMs and generate a plan on using di fferent modality tools to -solve multimodal queries [226]. -3.5. LLMs-Powered Agents -AI agents are autonomous entities, capable of planning, -decision-making, and performing actions to achieve complex -goals. In the early days, AI agents were rule-based, de- -signed for narrow tasks, and had limited capabilities, such -as Clippy [227] and Deep Blue [228]. In contrast to this, -LLMs abilities to respond to dynamic scenarios have made it -possible to incorporate them in diverse applications, includ- -ing LLMs-powered agents [224, 216], where LLMs behave -as the brain of agents. LLMs have been incorporated in web -agents [166, 167], coding agents [229], tool agents [27, 223], -embodied agents [26], and conversational agents [195], requir- -ing minimal to no fine-tuning". Below we summarize the re- -search in LLMs-based autonomous agents. For a more detailed -discussion, please refer to [230, 231]. -LLMs Steering Autonomous Agents: LLMs are the cognitive -controllers of the autonomous agents. They generate plans, rea- -son about tasks, incorporate memory to complete tasks, and -adapt the outline depending on the feedback from the environ- -ment. Depending on the acquired capabilities of LLMs, many -methods fine-tune, propose a better prompting approach, or uti- -lize di fferent modules to enhance agents’ performance. Mod- -ules and strategies employed in autonomous agents are briefly -discussed below. -Planning and Reasoning: Completing a complex task requires -human-like logical thinking, planning necessary steps, and -reasoning current and future directions. Prompting methods -like chain-of-thoughts [103], tree-of-thoughts [105], and self- -consistency [104] are central to agents, eliciting LLMs to rea- -son its actions and choose among di fferent paths for task com- -pletion. When LLMs are prompted with a task description and -a sequence of actions, they can accurately generate plan ac- -tions without any fine-tuning [232]. Reasoning via planning -(RAP) [233] incorporates a re-purposed LLM as a world model -to reason about future outcomes and explore alternative paths -for task completion. Retroformer [234] uses a retrospective -LLM to improve main LLM planning and reasoning capabil- -ities by providing helpful task cues. -Feedback: LLMs in open-loop systems generate plans and as- -sume that the agent will complete them successfully. However, -the actual scenario is di fferent with failures and variable re- -sponses from the environment. To correctly complete tasks, -many methods use LLMs in a closed-loop where the action re- -sponse is provided as feedback to the LLMs to re-assess and -update the plan as required [235, 236, 237, 195]. Another di- -rection of research exploits LLMs as reward functions to train -reinforcement learning (RL) policies instead of humans [238]. -Memory: LLMs can learn from the context provided in the -prompt. In addition to internal memory, various systems em- -ploy external memory to save the response history. Reflex- -ion [195] maintains an episodic memory to use previous re- -sponses as feedback to improve future decision-making. Retro- -former [234] improves its responses by employing short-termand long-term memory, where short-term memory contains re- -cent responses and long-term memory keeps summarized failed -attempts to add in the prompt as reflection. -Multi-Agents Systems: LLMs can play user-defined roles and -behave like a specific domain expert. In multi-agent systems, -each LLM is assigned a unique role, simulating human behav- -ior and collaborating with other agents to complete a complex -task [229, 239]. -LLMs in Physical Environment: LLMs are good at -instruction-following, however, utilizing them for physically -grounded tasks requires adaptation, as they lack real-world -knowledge. This could lead to generating illogical responses -for a particular physical situation [240, 26]. SayCan [240] -make LLMs aware of the available low-level task operations. -LLM (Say) builds a high-level plan to complete the task and -a learned a ffordance function (Can) explores the possibility of -executing the plan in the real world. SayCan uses RL to train -the language-conditioned a ffordance function. PaLM-E enables -the LLM to solve grounded tasks by training multi-modal LLM -feeding inputs directly from the sensors. -Manipulation: In the area of manipulation [236, 241], LLMs -enhance a robot’s dexterity and adaptability, excelling in tasks -like object recognition, grasping, and collaboration. They ana- -lyze visual and spatial information to determine the most e ffec- -tive approach to interact with objects. -Navigation: LLMs enhance a robot’s ability to navigate com- -plex environments with precision and adaptability [242, 243, -244, 245]. They generate feasible paths and trajectories for -robots, accounting for intricate environmental details [246]. -This ability is valuable in scenarios requiring precise and -dynamically adaptable navigation in environments like ware- -houses, transport, healthcare facilities, and residences. -3.6. E fficient LLMs -Deploying LLMs in production is expensive. Reducing their -running costs while preserving performance is an appealing -area of research. This section summarizes the approaches sug- -gested to enhance LLMs’ e fficiency. -3.6.1. Parameter E fficient Fine-Tuning -Fine-tuning LLMs with tens or hundreds of billions of pa- -rameters, such as GPT-3 (175B), BLOOM (176B), MT-NLG -(540B), etc., is computationally intensive and time-consuming. -To avoid complete model fine-tuning, numerous parameter- -efficient fine-tuning (PEFT) techniques [40, 247, 41, 38, 39] try -to achieve acceptable model fine-tuning performance at reduced -costs. As compared to full fine-tuning [248], PEFT performs -better in low-resource setups, achieves comparable perfor- -mance on medium-resource scenarios, and performs worse than -full fine-tuning under high-resource availability. An overview -of different PEFT approaches is shown in Figure 14. -Adapter Tuning: Adds a few trainable parameters within the -transformer block. The adapter layer is a sequence of feature -downscaling, non-linearity, and upscaling [106]. Variants of -adapter tuning inject adapter layers sequentially [106] and in -parallel [38], whereas the mixture of adapter (AdaMix) [249] -20 - ---- Page 21 --- -Figure 14: Illustration of parameter-e fficient fine-tuning paradigms, where xis input and his hidden state, figure courtesy [38]. Parallel adapter and LoRA fall in -the adapter tuning category. -employs multiple adapter modules in a single layer. AdaMix -routes input instances randomly to one of the multiple down- -scale and upscale modules. The mixture of adapters is averaged -out for inference to avoid additional latency. Low-Rank Adap- -tation (LoRA) [250] learns low-rank decomposed matrices to -freeze original weights. The learned weights are fused with the -original weights for inference, avoiding latency. -Prompt Tuning: Prompting is an e ffective way to adapt a -pre-trained LLM for the downstream task. However, manual -prompts bring uncertainty in the model’s prediction, where a -change in a single word drops the performance [247]. Prompt -tuning alleviates this problem by fine-tuning only 0.001%-3% -additional parameters [251]. It concatenates trainable prompt -parameters with the model embeddings [247, 40, 251]. Task- -specific fixed discrete prompts are concatenated with input em- -beddings in [40]. As discrete prompts bring instability, prompts -are encoded through a learnable mapping in P-Tuning [247], -naming continuous prompts, which are appended with the dis- -crete prompts. Only the prompt encoder is trainable in the -model. In an extension of P-Tuning, continuous prompts are -concatenated with each layer of the network in [251]. Progres- -sive prompts [252] avoid catastrophic forgetting and transfer -previously learned knowledge by sequentially adding trainable -prompt embeddings to the previously frozen task embeddings. -Prefix Tuning: A set of trainable task-specific prefix vectors -are appended to the frozen transformer layers in prefix tun- -ing [41]. The prefix vectors are virtual tokens attended by the -context tokens on the right. In addition, adaptive prefix tun- -ing [253] applies a gating mechanism to control the information -from the prefix and actual tokens. -Bias Tuning: Fine-tuning only bias terms in small to medium -training data has been found e ffective in BitFit [254]. This -method achieves full fine-tuning performance for tasks with less -training data and comparable performance with more training -data. -3.6.2. Quantization -LLMs require extensive computing and memory for infer- -ence. Deploying a 175B parameter GPT-3 model needs at -least five 80GB A100 GPUs and 350GB of memory to store inFP16 format [44]. Such demanding requirements for deploying -LLMs make it harder for smaller organizations to utilize them. -Model compression is an e ffective solution but comes at the cost -of degraded performance, especially at large scales greater than -6B. These models exhibit very large magnitude outliers that do -not exist in smaller models [255], making it challenging and re- -quiring specialized methods for quantizing LLMs [44, 256]. -Post-Training Quantization: Minimal or no training is re- -quired in this type of quantization, without significantly com- -promising the model performance. LLM-8-bit [255] uses full- -precision matrix multiplication for weights associated with out- -lier features and 8-bit for remaining features. The lower pre- -cision multiplication outputs are converted to FP-16 and con- -catenated with others. The quantized models have homogenous -word embeddings, which may degrade their performance. To -fix this, token-level knowledge distillation is employed in [45] -along with independent quantization scaling factors for each -module due to varying weight distribution. Feature distribu- -tions are asymmetric and appear in di fferent channels; outlier -suppression [257] shifts and scales per-channel activation dis- -tributions for e ffective quantization. SmoothQuant [44] quan- -tizes activations and weights to INT8 format by smoothing -activations and migrating the quantization di fficulty toward -weights. It multiplies the inverse of the smoothing factor with -weights, which introduces a few outliers in the weights but is -easier to quantify than unsmoothed activations. OPTQ [256] -uses the optimal brain compression (OBC) [258] algorithm to -quantize the model layer-by-layer and update weights to com- -pensate for quantization error. To improve speed and per- -formance, OPTQ updates weights in arbitrary order, employs -lazy updates, and uses better Cholesky kernels. Outlier-aware -weight quantization (OWQ) [259] uses the OPTQ algorithm for -quantization but assigns higher precision to vulnerable weights, -causing outliers and lower precision for others. -Quantization-Aware Training: To compensate for perfor- -mance degradation, a quantized model is fine-tuned in -quantization-aware training (QAT) [260, 261, 262]. Al- -pha Tuning quantizes the model using binary coding quan- -tization (BCQ) [263] and fine-tunes only quantization scal- -ing factors. This approach improves performance over -21 - ---- Page 22 --- -parameter-e fficient fine-tuning of the pre-trained model. Sim- -ilarly, parameter-e fficient and quantization-aware adaptation -(PEQA) [264] reduces the precision of fully-connected layers -and fine-tunes only quantization scaling parameters. LLM- -QAT [262] generates training data from the pre-trained network -and trains a quantized student model with knowledge distilla- -tion. QLoRA [261] fine-tunes 4-bit quantized pre-trained LLM -with LoRA [250] using a 4-bit normal float, which shows better -performance over a 4-bit integer and float. -3.6.3. Pruning -Pruning is an alternative approach to quantization to com- -press model size, thereby reducing LLMs deployment costs -significantly. Compared to task-agnostic pruning, task-specific -pruning is easily achievable with good performance, where a -model is fine-tuned on the downstream task and pruned for -faster inference. It is possible to prune LLMs for individual -tasks, but the cost of pruning and deploying task-specific mod- -els is high. To overcome this, many structured and unstructured -pruning methods for LLMs have been proposed to maintain rea- -sonable performance across all tasks while shrinking the model -size [265, 42, 266]. -Unstructured Pruning: This kind of pruning removes less im- -portant weights without maintaining any structure. Existing -LLM pruning methods take advantage of the unique charac- -teristics of LLMs, uncommon for smaller models, where a -small subset of hidden states are activated with large magni- -tude [255]. Pruning by weights and activations (Wanda) [265] -prunes weights in every row based on importance, calculated -by multiplying the weights with the norm of input. The pruned -model does not require fine-tuning, thereby saving computa- -tional costs. Outlier weighed layerwise sparsity (OWL) [267] -extends Wanda with non-uniform layer pruning. It shows that -the number of outliers varies for di fferent layers; therefore, the -model should have variable pruning ratios for better perfor- -mance for every layer. Contrastive pruning (CAP) [43] itera- -tively prunes the model by training the sparse model using con- -trastive loss between pre-trained, fine-tuned, and snapshots of -previous sparse models to learn task-specific and task-agnostic -knowledge. -Structured Pruning: Here, the parameters are removed in -groups, rows, columns, or matrices, which speeds up the -inference because of e ffective hardware tensor core utiliza- -tion [265]. LLM-Pruner [42] employs a 3-stage structured -pruning strategy, identifying the groups of hidden states caus- -ing each other to activate during the forward-pass, keeping im- -portant groups and removing less important ones, and fine- -tuning the pruned model with LoRA. Sparsity-induced mask -learning (SIMPLE) [268] prunes the network using learnable -masks. Similarly, another method prunes LLMs by learning -masks and removing unimportant rank-1 components of the -factorized weight matrix [266]. -3.7. Multimodal LLMs -Inspired by the success of LLMs in natural language process- -ing applications, an increasing number of research works arenow facilitating LLMs to perceive di fferent modalities of infor- -mation like image [269, 270, 271], video [272, 273, 274], au- -dio [275, 274, 276], etc. Multimodal LLMs (MLLMs) present -substantial benefits compared to standard LLMs that process -only text. By incorporating information from various modal- -ities, MLLMs can achieve a deeper understanding of context, -leading to more intelligent responses infused with a variety of -expressions. Importantly, MLLMs align closely with human -perceptual experiences, leveraging the synergistic nature of our -multisensory inputs to form a comprehensive understanding of -the world [276, 26]. Coupled with a user-friendly interface, -MLLMs can o ffer intuitive, flexible, and adaptable interactions, -allowing users to engage with intelligent assistants through a -spectrum of input methods. According to the ways of construct- -ing models, current MLLMs can be generally divided into three -streams: pre-training, fine-tuning, and prompting. In this sec- -tion, we will discuss more details of these main streams, as well -as the important application of MLLMs in visual reasoning. -Pre-training: This stream of MLLMs intends to support di ffer- -ent modalities using unified end-to-end models. For instance, -Flamingo [269] applies gated cross-attention to fuse vision and -language modalities, which are collected from pre-trained and -frozen visual encoder and LLM, respectively. Moreover, BLIP- -2 [270] proposes a two-stage strategy to pre-train a Querying -Transformer (Q-Former) for the alignment between vision and -language modalities: in the first stage, vision-language repre- -sentation learning is bootstrapped from a frozen visual encoder; -and in the second stage, a frozen LLM bootstraps vision-to- -language generative learning for zero-shot image-to-text gen- -eration. Similarly, MiniGPT-4 [277] deploys pre-trained and -frozen ViT [278], Q-Former and Vicuna LLM [159], only train- -ing the linear projection layer for vision and language modali- -ties alignment. -Fine-tuning: Derived from instruction tuning [16] for NLP -tasks [20, 16, 97], researchers are fine-tune pre-trained LLMs -using multimodal instructions. Following this method, LLMs -can be easily and e ffectively extended as multimodal chat- -bots [277, 271, 29] and multimodal task solvers [279, 30, 280]. -The key issue of this stream of MLLMs is to collect multi- -modal instruction-following data for fine-tuning [58]. To ad- -dress this issue, the solutions of benchmark adaptation [279, -281, 282], self-instruction [19, 31, 283], and hybrid composi- -tion [284, 280] are employed, respectively. To mitigate the gap -between the original language modality and additional modal- -ities, the learnable interface is introduced to connect di ffer- -ent modalities from frozen pre-trained models. Particularly, -the learnable interface is expected to work in a parameter- -efficient tuning manner: e.g., LLaMA-Adapter [285] applies -an efficient transformer-based adapter module for training, -and LaVIN [284] dynamically learns the multimodal feature -weights using a mixture-of-modality adapter. Di fferent from -the learnable interface, the expert models can directly convert -multimodalities into language: e.g., VideoChat-Text [272] in- -corporates Whisper [286], a speech recognition expert model, -to generate the captions of given videos for the understanding -of following LLMs. -Prompting: Different from the fine-tuning technique that -22 - ---- Page 23 --- -directly updates the model parameters given task-specific -datasets, the prompting technique provides certain context, ex- -amples, or instructions to the model, fulfilling specialized tasks -without changing the model parameters. Since prompting can -significantly reduce the need for large-scale multimodal data, -this technique is widely used to construct MLLMs. Particularly, -to solve multimodal Chain of Thought (CoT) problems [103], -LLMs are prompted to generate both the reasoning process and -the answer given multimodal inputs [287]. On this front, di ffer- -ent learning paradigms are exploited in practice: for example, -Multimodal-CoT [287] involves two stages of rationale genera- -tion and answer inference, where the input of the second stage -is a combination of the original input and the output of the first -stage; and CoT-PT [288] applies both prompt tuning and spe- -cific visual bias to generate a chain of reasoning implicitly. In -addition to CoT problems, LLMs can also be prompted with -multimodal descriptions and tools, e ffectively dividing complex -tasks into sub-tasks [289, 290]. -Visual Reasoning Application: Recent visual reasoning sys- -tems [291, 292, 216, 293] tend to apply LLMs for better visual -information analysis and visual-language integration. Di ffer- -ent from previous works [294, 295] that rely on limited VQA -datasets and small-scale neural networks, current LLM-aided -methods o ffer benefits of stronger generalization ability, emer- -gent ability, and interactivity [58]. To realize visual reasoning -with the help of LLMs, prompting and fine-tuning techniques -can also be utilized: for example, PointClip V2 [292] applies -LLMs to generate 3D-specific prompts, which are encoded as -textual features and then combined with visual features for -3D recognition; and GPT4Tools [31] employs LoRA [250] to -fine-tune LLMs following tool-related instructions. Serving -as a controller [293], decision maker [296], or semantics re- -finer [291, 297], LLMs significantly facilitates the progress of -visual reasoning research. -3.8. Summary and Discussion -3.8.1. Architecture -Due to the gigantic scale of LLMs, minor changes in archi- -tecture and training strategies have a big impact on performance -and stability. Here, we summarize key architectural modules -used in various LLMs, leading to better performance, reduced -training time and memory, and better training stability. -Layer Normalization: The performance and training stability -of LLMs are a ffected significantly by layer normalization. Pre- -norm, that is normalizing inputs rather than outputs, is more -common among LLMs stabilizing the training [6, 127, 108]. -BLOOM [13] and AlexaTM [122] utilize an additional layer -normalization before embedding layer to stabilize the training -of large-scale models, while the model’s zero-shot generaliza- -tion ability can be negatively impacted [13]. However, another -study [33] finds that pre-norm degrades fine-tuned model per- -formance as compared to post-norm, and there are no stability -benefits of pre-norm beyond the 100B scale. Therefore, GLM- -130B [33] used deep-norm which is a variant of post-norm for -better downstream task performance after fine-tuning. -Positional Encoding: Like other building blocks of the model,positional encoding also a ffects the performance and training -stability of LLMs. BLOOM [13] finds ALiBi outperforms -learned and rotary positional encodings. Contrary to this, -GLM-130B [33] identifies rotary positional encoding as being -better than ALiBi. So, there is no conclusion in the literature -about positional encodings yet. -Parallel Attention: In this type of attention, feed-forward and -attention layers are parallel to each other rather than sequen- -tial in a transformer block. It has been shown to reduce train- -ing time by 15%. There is no evidence of performance drop -due to this change in the literature and it is used by the models -PaLM [15], GPT-NeoX [118], and CodeGen [140]. -Multi-Query Attention It has shared key and value attention -heads in a transformer block while query attention heads are -projected as usual. This reduces memory usage and speeds up -sampling in autoregressive decoding. No performance degrada- -tion has been observed with this change and it makes the train- -ing efficient allowing larger batch sizes. Multi-query attention -is used in [15, 142]. -Mixture of Experts: This type of architecture enables eas- -ily scaling models to trillions of parameters [92, 91]. Only a -few experts are activated during the computation making them -compute-e fficient. The performance of MoE models is better -than dense models for the same amount of data and requires less -computation during fine-tuning to achieve performance similar -to dense models as discussed in [91]. MoE architectures are -less prone to catastrophic forgetting, therefore are more suited -for continual learning [92]. Extracting smaller sub-models for -downstream tasks is possible without losing any performance, -making MoE architecture hardware-friendly [92]. -Sparse vs Dense Activated: GPT-3 [6] uses sparse transform- -ers [67] whereas GLaM [91] and PanGu-P[92] use MoE [121] -architectures to lower computational costs and increase the -model size and capacity. According to the literature, sparse -modules do not degrade the model’s performance [67]. How- -ever, more experiments are required to verify this statement. -3.8.2. Training Strategies -Training models at a huge scale require tricks to reduce train- -ing costs, avoid loss divergence, and achieve better perfor- -mance. We summarize and discuss some of these key tricks -used in di fferent LLMs. -Mixed Precision: It is a famous method for LLMs to reduce -memory usage and improve training e fficiency. In mixed pre- -cision, forward and backward passes are performed in FP16 -format whereas optimizer states and master weights are kept -in FP32 format [120]. A drawback associated with this for- -mat change is training instability due to a smaller value range -resulting in loss spikes [33]. An alternative to FP16 is BF16 -which has a comparatively larger range and performs precision- -sensitive operations like gradient accumulation and softmax in -FP32 [13]. BF16 has better performance and training stability -but uses more memory and is supported on specific hardware, -for example, A100 GPUs. Therefore, its adoption in LLMs is -limited. -Training Instability: Loss divergence or spiking is a common -issue in LLMs that occurs multiple times during training. This -23 - ---- Page 24 --- -happens in the presence of gradient clipping [15]. To mitigate -this problem, many approaches suggest restarting training from -an earlier checkpoint [15, 33, 91], skipping 200-500 earlier -data batches at the point of divergence in [15] and re-shu ffling -batches in [91]. The embedding layer gradient shrink proves to -further stabilize the training as its gradient norm is significantly -larger than the other layers [33]. Another suggestion to improve -training stability for larger models is not to use biases in dense -and norm layers as in [15]. -Weight Initialization: It plays a significant role in model con- -vergence and training stability. GPT-NeoX [118] initializes -feed-forward layers before residuals with2 -L√ -das in [153] and -other layers with the small initialization scheme [298]. This -avoids activations growing exponentially with increasing depth. -MT-NLG [117] found higher variance for weight initialization -leads to unstable training, hence validating small initialization -scheme [298]. Various models perform random weight initial- -ization which can cause bad initialization, Galactica [148] sug- -gests a longer warmup to negate the e ffect. -Learning Rate: A suitable learning rate is important for sta- -ble training. It is suggested to use a lower value [13, 15, 124] -with warmup and decay (cosine or linear). Usually, the learn- -ing rate is within the range 1 e−4to 8e−4. Moreover, MT-NLG -(530B) [117] and GPT-NeoX (20B) [118] suggest interpolat- -ing learning rates based on the model size using the GPT-3 [6] -models ranging between 13B and 175B. This avoids tuning the -learning rate hyperparameter. -Training Parallelism: 3D parallelism, a combination of data, -pipeline, and tensor parallelism, is the most utilized training -parallelism approach in LLMs [33, 15, 14, 13, 117, 115, 112]. -In addition to 3D parallelism, BLOOM [13] uses a zero op- -timizer [37] to shard optimizer states. PanGu- α[108] and -PanGu- Σ[92] go beyond 3D parallelism and apply 5D paral- -lelism which additionally contains optimizer parallelism and -rematerialization. -Mode Switching: It adds task-related tokens at the beginning -of the text during training. These tokens refer to the natural -language understanding and natural language generation tasks -which are shown to improve downstream task performance -in [125, 124, 122]. During fine-tuning and inference, tokens -are appended based on the downstream tasks. -Controllable Text Generation: Generating credible and con- -trolled text from a pre-trained model is challenging. GPT-3 [6] -and other LLMs use in-context learning to control generated -text. While in-context learning helps in controlling the gener- -ated text, ERNIE 3.0 Titan [35] suggests using adversarial loss -to rank its generated text for credibility and soft prompts such as -genre, topic, keywords, sentiment, and length for better control -on generated text. -3.8.3. Supervised Models vs Generalized Models -Although generalized models are capable of performing di- -verse tasks with good performance they have not yet outper- -formed models trained in supervised settings. The supervised -trained models are still state-of-the-art in various NLP tasks by -a large margin as shown in [6, 15, 18].3.8.4. Zero-Shot vs Few-Shot -LLMs perform well in zero-shot and few-shot settings. But -the performance di fference between zero-shot and few-shot is -large for pre-trained models [6, 15], naming LLMs as meta- -learners [6]. LLMs zero-shot evaluations underperform unsu- -pervised methods in neural machine translation [6]. The liter- -ature shows pre-training is not enough for good zero-shot per- -formance [15, 16]. To improve the zero-shot performance the -literature suggests using instruction fine-tuning that improves -the zero-shot performance significantly and outperforms base- -lines. Instruction fine-tuning has also been shown to improve -zero-shot generalization to unseen tasks. Another model, Flan- -PaLM [16], unlocks zero-shot reasoning with CoT training. -3.8.5. Encoder vs Decoder vs Encoder-Decoder -Traditionally, these architectures perform well for di fferent -tasks, for example, encoder-only for NLU tasks, decoder-only -for NLG, and encoder-decoder for sequence2sequence model- -ing. Encoder-only models are famous for smaller models such -as Bert [7], RoBERTa [299], etc., whereas LLMs are either -decoder-only [6, 118, 13] or encoder-decoder [10, 11, 122]. -While decoder-only models are good at NLG tasks, various -LLMs, PaLM [15], OPT [14], GPT-3 [6], BLOOM [13], -LLaMA [156], are decoder-only models with significant per- -formance gains on both NLU and NLG tasks. In contradic- -tion to this, T5 [10] and UL2 [125] identify encoder-decoder -models out-performing decoder-only models. In another study, -PaLM [15] finds increasing the size of decoder-only models -can reduce the performance gap between decoder-only and -encoder-decoder architectures. -Although decoder-only architectures have become a trend for -LLMs, many recently proposed approaches [125, 122] use -mode-switching tokens in text with encoder-decoder architec- -tures to enable task-specific modes. Similarly, CodeT5 +[34] -uses an encoder-decoder architecture with multiple training ob- -jectives for di fferent tasks, activating the encoder, decoder, or -both according to the tasks. These variations in architecture -and training objectives allow a model to perform well in di ffer- -ent settings. Because of this dynamic configuration, the future -of LLMs can be attributed to encoder-decoder architectures. -4. Model Configurations -We provide di fferent statistics of pre-trained and instruction- -tuned models in this section. This includes information such as -publication venue, license type, model creators, steps trained, -parallelism, etc in Table 3 and Table 4. Architecture details -of pre-trained LLMs are available in Table 5. Providing these -details for instruction-tuned models is unnecessary because it -fine-tunes pre-trained models for instruction datasets. Hence, -architectural details are the same as the baselines. Moreover, -optimization settings for various LLMs are available in Table 6 -and Table 7. We do not include details on precision, warmup, -and weight decay in Table 7. These details are not as important -as others to mention for instruction-tuned models, and are not -provided by the papers. -24 - ---- Page 25 --- -Table 3: Summary of pre-trained LLMs ( >10B). Only the LLMs discussed individually in the previous sections are summarized. “Data /Tokens” is the model’s -pre-training data, which is either the number of tokens or data size. “Data Cleaning” indicates whether data cleaning is performed or not. This includes heuristics -(Heur), deduplication (Dedup), quality filtering (QF), and privacy filtering (PF), “Cost” is the calculated training cost obtained by multiplying the GPUs /TPUs -hourly rate with the number of GPUs and the training time. The actual cost may vary due to many reasons such as using in-house GPUs or getting a discounted rate, -re-training, number of employees working on the problem, etc. “Training Parallelism” indicates distributed training using data parallelism (D), tensor parallelism -(T), pipeline parallelism (P), context parallelism (C), model parallelism (M), optimizer parallelism (OP), and rematerialization (R), where for “Library” column, -“DS” is a short form for Deep Speed. In column “Commercial Use”, we assumed a model is for non-commercial purposes if its license is unavailable. -ModelsPublication -VenueLicense -TypeModel -Creators PurposeNo. of -ParamsCommercial -UseSteps -TrainedData / -TokensData -CleaningNo. of -Processing UnitsProcessing -Unit TypeTraining -TimeCalculated -Train. CostTraining -Parallelism Library -T5 [10] JMLR'20 Apache-2.0 Google General 11B ✓ 1M 1T Heur+Dedup 1024 TPU v3 - - D+M Mesh TensorFlow -GPT-3 [6] NeurIPS'20 - OpenAI General 175B× - 300B Dedup +QF - V100 - - M - -mT5 [11] NAACL'21 Apache-2.0 Google General 13B ✓ 1M 1T - - - - - - - -PanGu-α[108] arXiv'21 Apache-2.0 Huawei General 200B ✓ 260k 1.1TB Heur+Dedup 2048 Ascend 910 - - D+OP+P+O+R MindSpore -CPM-2 [12] AI Open'21 MIT Tsinghua General 198B ✓ 1M 2.6TB Dedup - - - - D+M JAXFormer -Codex [141] arXiv'21 - OpenAI Coding 12B× - 100B Heur - - - - - - -ERNIE 3.0 [110] arXiv'21 - Baidu General 10B× 120k∗375B Heur+Dedup 384 V100 - - M∗PaddlePaddle -Jurassic-1 [112] White-Paper'21 Apache-2.0 AI21 General 178B ✓ - 300B - 800 GPU - - D+M+P Megatron +DS -HyperCLOV A [114] EMNLP'21 - Naver General 82B× - 300B Clf+Dedup +PF 1024 A100 321h 1.32 Mil M Megatron -Yuan 1.0 [115] arXiv'21 Apache-2.0 - General 245B ✓ 26k∗180B Heur+Clf+Dedup 2128 GPU - - D+T+P - -Gopher [116] arXiv'21 - Google General 280B× - 300B QF+Dedup 4096 TPU v3 920h 13.19 Mil D+M JAX+Haiku -ERNIE 3.0 Titan [35] arXiv'21 - Baidu General 260B× - 300B Heur+Dedup - Ascend 910 - - D+M+P+D* PaddlePaddle -GPT-NeoX-20B [118] BigScience'22 Apache-2.0 EleutherAI General 20B ✓ 150k 825GB None 96 40G A100 - - M Megatron +DS+PyTorch -OPT [14] arXiv'22 MIT Meta General 175B ✓ 150k 180B Dedup 992 80G A100 - - D+T Megatron -BLOOM [13] arXiv'22 RAIL-1.0 BigScience General 176B ✓ - 366B Dedup +PR 384 80G A100 2520h 3.87 Mil D+T+P Megatron +DS -Galactica [148] arXiv'22 Apache-2.0 Meta Science 120B× 225k 106B Dedup 128 80GB A100 - - - Metaseq -GLaM [91] ICML'22 - Google General 1.2T× 600k∗600B Clf 1024 TPU v4 - - M GSPMD -LaMDA [150] arXiv'22 - Google Dialog 137B× 3M 2.81T Filtered 1024 TPU v3 1384h 4.96 Mil D+M Lingvo -MT-NLG [117] arXiv'22 Apache-v2.0 MS.+Nvidia General 530B× - 270B - 4480 80G A100 - - D+T+P Megatron +DS -AlphaCode [142] Science'22 Apache-v2.0 Google Coding 41B ✓ 205k 967B Heur+Dedup - TPU v4 - - M JAX+Haiku -Chinchilla [96] arXiv'22 - Google General 70B× - 1.4T QF+Dedup - TPUv4 - - - JAX+Haiku -PaLM [15] arXiv'22 - Google General 540B× 255k 780B Heur 6144 TPU v4 - - D+M JAX+T5X -AlexaTM [122] arXiv'22 Apache v2.0 Amazon General 20B× 500k 1.1T Filtered 128 A100 2880h 1.47 Mil M DS -U-PaLM [124] arXiv'22 - Google General 540B× 20k - - 512 TPU v4 120h 0.25 Mil - - -UL2 [125] ICLR'23 Apache-2.0 Google General 20B ✓ 2M 1T - 512 TPU v4 - - M JAX+T5X -GLM [33] ICLR'23 Apache-2.0 Multiple General 130B× - 400B - 768 40G A100 1440h 3.37 Mil M - -CodeGen [140] ICLR'23 Apache-2.0 Salesforce Coding 16B ✓ 650k 577B Heur+Dedup - TPU v4 - - D+M JAXFormer -LLaMA [127] arXiv'23 - Meta General 65B× 350k 1.4T Clf+Heur+Dedup 2048 80G A100 504h 4.12 Mil D+M xFormers -PanGu Σ[92] arXiv'23 - Huawei General 1.085T× - 329B - 512 Ascend 910 2400h - D+OP+P+O+R MindSpore -BloombergGPT [151] arXiv23 - Bloomberg Finance 50B× 139k 569B Dedup 512 40G A100 1272h 1.97 Mil M PyTorch -Xuan Yuan 2.0 [152] arXiv23 RAIL-1.0 Du Xiaoman Finance 176B ✓ - 366B Filtered - 80GB A100 - - P DS -CodeT5 +[34] arXiv'23 BSD-3 Salesforce Coding 16B ✓ 110k 51.5B Dedup 16 40G A100 - - - DS -StarCoder [147] arXiv'23 OpenRAIL-M BigCode Coding 15.5B ✓ 250k 1T Dedup +QF+PF 512 80G A100 624h 1.28 Mil D+T+P Megatron-LM -LLaMA-2 [21] arXiv'23 LLaMA-2.0 Meta General 70B ✓ 500k 2T Minimal Filtering - 80G A100 1.7Mh - - - -PaLM-2 [123] arXiv'23 - Google General -× - -Ddedup +PF+QF - - - - - - -LLaMA-3.1 [130] arXiv'24 LLaMA-3.0 Meta General 405B ✓ 1.2M 15T Dedup +QF 16k 80G H100 30.84Mh - D+T+P+C PyTorch -Mixtral 8x22B [131] web'24 Apache-2.0 Mistral AI General 141B ✓ - - - - - - - - - -Snowflake Arctic [132] web'24 Apache-2.0 Snowflake General 480B ✓ - 3.5T - - - - T+P DS -Nemotron-4 340B [137] web'24 Nvidia Nvidia General 340B ✓ - 9T - 6144 80G H100 - - D+T+P - -DeepSeek [138] arXiv'24 MIT DeepSeek General 67B ✓ - 2T Dedup +QF - - 300.6Kh - D+T+P DS -DeepSeek-v2 [139] arXiv'24 MIT DeepSeek General 67B ✓ - 8.1T QF - H800 172.8Kh - D+P HAI-LLM -Table 4: Summary of instruction tuned LLMs ( >10B). All abbreviations are the same as Table 3. Entries in “Data /Tokens” starting with “S-” represent the number -of training samples. -ModelsPublication -VenueLicense -TypeModel -Creators PurposeNo. of -ParamsCommercial -UsePre-trained -ModelsSteps -TrainedData / -TokensNo. of -Processing UnitsProcessing -Unit TypeTrain. -TimeCalculated -Train. CostTrain. -Parallelism Library -WebGPT [166] arXiv'21 - OpenAI General 175B× GPT-3 - - - - - - - - -T0 [17] ICLR'22 Apache-2.0 BigScience General 11B ✓ T5 - 250B 512 TPU v3 270h 0.48 Mil - - -Tk-Instruct [18] EMNLP'22 MIT AI2+ General 11B ✓ T5 1000 - 256 TPU v3 4h 0.0036 Mil - Google T5 -OPT-IML [97] arXiv'22 - Meta General 175B× OPT 8k 2B 128 40G A100 - - D+T Megatron -Flan-U-PaLM [16] ICLR'22 Apache-2.0 Google General 540B ✓ U-PaLM 30k - 512 TPU v4 - - - JAX+T5X -mT0 [154] ACL'23 Apache-2.0 HuggingFace +General 13B ✓ mT5 - - - - - - - - -Sparrow [167] arXiv'22 - Google Dialog 70B× Chinchilla - - 64 TPU v3 - - M - -WizardCoder [164] arXiv'23 Apache-2.0 HK Bapt. Coding 15B× StarCoder 200 S-78k - - - - - - -Alpaca [158] Github'23 Apache-2.0 Stanford General 13B ✓ LLaMA 3-Epoch S-52k 8 80G A100 3h 600 FSDP PyTorch -Vicuna [159] Github'23 Apache-2.0 LMSYS General 13B ✓ LLaMA 3-Epoch S-125k - - - - FSDP PyTorch -LIMA [185] arXiv'23 - Meta+ General 65B - LLaMA 15-Epoch S-1000 - - - - - - -Koala [300] Github'23 Apache-2.0 UC-Berkley General 13B× LLaMA 2-Epoch S-472k 8 A100 6h 100 - JAX/FLAX -5. Datasets and Evaluation -Generating training and evaluation datasets is expensive be- -cause of the large-scale data demand of LLMs. Hence, datasets -for training and benchmarking these models are topics of key -importance. A summary of datasets commonly used by LLMs -is provided next.5.1. Training Datasets -The performance of LLMs largely depends on the training -data’s quality, size, and diversity. Preparing training datasets -of high quality at a large scale is laborious. Researchers have -suggested various pre-training and fine-tuning datasets to en- -hance LLMs capabilities. We summarize these e fforts in Ta- -ble 8. While numerous training datasets are available in the -literature, we cover the most widely used ones in our summary. -25 - ---- Page 26 --- -Table 5: Architecture details of LLMs. Here, “PE” is the positional embedding, “nL” is the number of layers, “nH” is the number of attention heads, “HS” is the -size of hidden states. -Models TypeTraining -ObjectiveAttention Vocab Tokenizer Norm PE Activation Bias nL nH HS -T5 (11B) Enc-Dec Span Corruption Standard 32k SentencePiece Pre-RMS Relative ReLU× 24 128 1024 -GPT3 (175B) Causal-Dec Next Token Dense +Sparse - - Layer Learned GeLU ✓ 96 96 12288 -mT5 (13B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - - - - -PanGu-α(200B) Causal-Dec Next Token Standard 40k BPE Layer - - - 64 128 16384 -CPM-2 (198B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - 24 64 - -Codex (12B) Causal-Dec Next Token Standard - BPE+ Pre-Layer Learned GeLU - 96 96 12288 -ERNIE 3.0 (10B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 64 4096 -Jurassic-1 (178B) Causal-Dec Next Token Standard 256k SentencePiece∗Pre-Layer Learned GeLU ✓ 76 96 13824 -HyperCLOV A (82B) Causal-Dec Next Token Dense +Sparse - BPE* Pre-Layer Learned GeLU - 64 80 10240 -Yuan 1.0 (245B) Causal-Dec Next Token Standard - - - - - - 76 -16384 -Gopher (280B) Causal-Dec Next Token Standard 32k SentencePiece Pre-RMS Relative GeLU ✓ 80 128 16384 -ERNIE 3.0 Titan (260B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 192 12288 -GPT-NeoX-20B Causal-Dec Next Token Parallel 50k BPE Layer Rotary GeLU ✓ 44 64 - -OPT (175B) Causal-Dec Next Token Standard - BPE - - ReLU ✓ 96 96 - -BLOOM (176B) Causal-Dec Next Token Standard 250k BPE Layer ALiBi GeLU ✓ 70 112 14336 -Galactica (120B) Causal-Dec Next Token Standard 50k BPE+custom Layer Learned GeLU× 96 80 10240 -GLaM (1.2T) MoE-Dec Next Token Standard 256k SentencePiece Layer Relative GeLU ✓ 64 128 32768 -LaMDA (137B) Causal-Dec Next Token Standard 32k BPE Layer Relative GeGLU - 64 128 8192 -MT-NLG (530B) Causal-Dec Next Token Standard 50k BPE Pre-Layer Learned GeLU ✓ 105 128 20480 -AlphaCode (41B) Enc-Dec Next Token Multi-query 8k SentencePiece - - - - 64 128 6144 -Chinchilla (70B) Causal-Dec Next Token Standard 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 80 64 8192 -PaLM (540B) Causal-Dec Next Token Parallel +Multi-query 256k SentencePiece Layer RoPE SwiGLU×118 48 18432 -AlexaTM (20B) Enc-Dec Denoising Standard 150k SentencePiece Pre-Layer Learned GeLU ✓ 78 32 4096 -Sparrow (70B) Causal-Dec Pref.&Rule RM - 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 16∗64 8192 -U-PaLM (540B) Non-Causal-Dec MoD Parallel +Multi-query 256k SentencePiece Layer RoPE SwiGLU×118 48 18432 -UL2 (20B) Enc-Dec MoD Standard 32k SentencePiece - - - - 64 16 4096 -GLM (130B) Non-Causal-Dec AR Blank Infilling Standard 130k SentencePiece Deep RoPE GeGLU ✓ 70 96 12288 -CodeGen (16B) Causal-Dec Next Token Parallel - BPE Layer RoPE - - 34 24 - -LLaMA (65B) Causal-Dec Next Token Standard 32k BPE Pre-RMS RoPE SwiGLU - 80 64 8192 -PanGu- Σ(1085B) Causal-Dec Next Token Standard - BPE Fused Layer - FastGeLU - 40 40 5120 -BloombergGPT (50B) Causal-Dec Next Token Standard 131k Unigram Layer ALiBi GeLU ✓ 70 40 7680 -Xuan Yuan 2.0 (176B) Causal-Dec Next Token Self 250k BPE Layer ALiBi GeLU ✓ 70 112 14336 -CodeT5 +(16B) Enc-Dec SC+NT+Cont. +Match Standard - Code-Specific - - - - - - - -StarCoder (15.5B) Causal-Dec FIM Multi-query 49k BPE - Learned - - 40 48 6144 -LLaMA-2 (70B) Causal-Dec Next Token Grouped-query 32k BPE Pre-RMS RoPE SwiGLUE - - - - -PaLM-2 - MoD Parallel - - - - - - - - - -LLaMA-3.1 (405B) Causal-Dec Next Token Grouped-query 128k BPE Pre-RMS RoPE SwiGLU -126 128 16384 -Nemotron-4 (340B) Causal-Dec Next Token Standard 256k SentencePiece - RoPE ReLU× 96 96 18432 -DeepSeek (67B) Causal-Dec Next Token Grouped-query 100k BBPE Pre-RMS RoPE SwiGLU - 95 64 8192 -DeepSeek-v2 (67B) MoE-Dec Next Token Multi-Head Latent 100k BBPE Pre-RMS RoPE SwiGLU - 60 128 5120 -5.2. Evaluation Datasets and Tasks -The evaluation of LLMs is important in gauging their profi- -ciency and limitations. This process measures the model’s abil- -ity to comprehend, generate, and interact with human language -across a spectrum of tasks. Evaluating a language model (LM) -is divided into two broader categories: 1) natural language un- -derstanding (NLU) and 2) natural language generation (NLG). -It is emphasized that tasks in NLU and NLG are softly catego- -rized and are often used interchangeably in the literature. -Natural Language Understanding: It measures the language -understanding capacity of LMs. It encompasses multiple tasks, -including sentiment analysis, text classification, natural lan- -guage inference (NLI), question answering (QA), common- -sense reasoning (CR), mathematical reasoning (MR), reading -comprehension (RC), etc. -Natural Language Generation: It assesses the language gener- -ation capabilities of LLMs by understanding the provided input -context. It includes tasks such as summarization, sentence com- -pletion, machine translation (MT), dialogue generation, etc. -Numerous datasets are proposed for each task, evaluating -LLMs against di fferent characteristics. To provide an overview -of evaluation datasets, we briefly discuss a few famous datasets -within each category and o ffer a comprehensive list of datasets -in Table 9. Moreover, we show a detailed overview of the train- -ing datasets and evaluation tasks and benchmarks used by vari-ous pre-trained LLMs in Table 10 and fine-tuned LLMs in Ta- -ble 11. We also compare the top-performing LLMs in various -NLP tasks in Table 12. -5.2.1. Multi-task -MMLU [307]: A benchmark that measures the knowledge -acquired by models during pretraining and evaluates models in -zero-shot and few-shot settings across 57 subjects, testing both -world knowledge and problem-solving ability. -SuperGLUE [2]: A more challenging and diverse successor -to the GLUE [309] benchmark, SuperGLUE includes a variety -of language understanding tasks, such as question answering, -natural language inference, and co-reference resolution. It is -designed to provide a rigorous test of language understanding -and requires significant progress in areas like sample-e fficient, -transfer, multi-task, and unsupervised or self-supervised learn- -ing. -BIG-bench [308]: The BIG-bench (Behavior of Intelligent -Generative Models Benchmark) is a large-scale benchmark de- -signed to test the abilities of LLMs across a wide range of -tasks, including reasoning, creativity, ethics, and understanding -of specific domains. -GLUE [309]: The General Language Understanding Evalua- -tion (GLUE) benchmark is a collection of resources for train- -ing, evaluating, and analyzing natural language understanding -26 - ---- Page 27 --- -Table 6: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and dropout are 0.1, 1.0, and 0.1, respectively, -for most of the LLMs. -Sequence LR Optimizers Precision Weight Grad -Models Batch Size Length LR Warmup Decay AdaFactor Adam AdamW FP16 BF16 Mixed Decay Clip Dropout -T5 (11B) 211512 0.01× inverse square root ✓ - - - - - ✓ -GPT3 (175B) 32K - 6e-5 ✓ cosine ✓ ✓ ✓ ✓ - -mT5 (13B) 1024 1024 0.01 - inverse square root ✓ - - - - - ✓ -PanGu-α(200B) - 1024 2e-5 - - - - - -✓ - - - - -CPM-2 (198B) 1024 1024 0.001 - - ✓ - - - - - ✓ -Codex (12B) - - 6e-5 ✓ cosine ✓ ✓ ✓ - - -ERNIE 3.0 (12B) 6144 512 1e-4 ✓ linear ✓ - - - ✓ - - -Jurassic-1 (178B) 3.2M 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ - -HyperCLOV A (82B) 1024 - 6e-5 - cosine ✓ - - - ✓ - - -Yuan 1.0 (245B) <10M 2048 1.6e-4 ✓ cosine decay to 10% ✓ - - - ✓ - - -Gopher (280B) 3M 2048 4e-5 ✓ cosine decay to 10% ✓ ✓ - ✓ - -ERNIE 3.0 Titan (260B) - 512 1e-4 ✓ linear ✓ ✓ ✓ ✓ - -GPT-NeoX-20B 1538 2048 0.97e-5 ✓ cosine ✓ ✓ ✓ ✓× -OPT (175B) 2M 2048 1.2e-4 - linear ✓ ✓ ✓ ✓ ✓ -BLOOM (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓× -Galactica (120B) 2M 2048 7e-6 ✓ linear decay to 10% ✓ - - - ✓ ✓ ✓ -GLaM (1.2T) 1M 1024 0.01 - inverse square root ✓ FP32 +✓ - ✓× -LaMDA (137B) 256K - - - - - - - - - - - - - -MT-NLG (530B) 1920 2048 5e-5 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ - -AlphaCode (41B) 2048 1536+768 1e-4 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ - -Chinchilla (70B) 1.5M 2048 1e-4 ✓ cosine decay to 10% ✓ ✓ - - - -PaLM (540B) 2048 2048 0.01 - inverse square root ✓ - - - ✓ ✓× -AlexaTM (20B) 2M 1024 1e-4 - linear decay to 5% ✓ ✓ ✓ - ✓ -U-PaLM (540B) 32 2048 1e-4 - cosine ✓ - - - - - - -UL2 (20B) 1024 1024 - - inverse square root - - - - - -× - - -GLM (130B) 4224 2048 8e-5 ✓ cosine ✓ ✓ ✓ ✓ ✓ -CodeGen (16B) 2M 2048 5e-5 ✓ cosine ✓ - - - ✓ ✓ - -LLaMA (65B) 4M Tokens 2048 1.5e-4 ✓ cosine decay to 10% ✓ - - - ✓ ✓ - -PanGu- Σ(1.085T) 512 1024 2e-5 ✓ - ✓ ✓ - - - -BloombergGPT (50B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓× -Xuan Yuan 2.0 (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ - -CodeT5 +(16B) 2048 1024 2e-4 - linear ✓ ✓ ✓ - - -StarCoder (15.5B) 512 8k 3e-4 ✓ cosine ✓ ✓ ✓ - - -LLaMA-2 (70B) 4M Tokens 4k 1.5e-4 ✓ cosine ✓ ✓ ✓ ✓ - -LLaMA-3.1 (405B) 16M 8192 8e-5 ✓ linear +cosine ✓ ✓ - - - -Nemotron-4 (340B) 2304 4096 - - linear - - - ✓ - -× -DeepSeek (67B) 4608 4096 3.2e-4 ✓ cosine ✓ ✓ ✓ ✓ - -DeepSeek-v2 (67B) 9216 4k 2.4e-4 ✓ step-decay ✓ - - - ✓ ✓ - -Table 7: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are the same as the pre-trained models, while -no model uses weight decay for instruction tuning. -Sequence Optimizers Grad -Models Batch Size Length LR Warmup LR_Decay AdaFactor Adam AdamW Clip Dropout -WebGPT (175B) BC:512, RM:32 -6e-5 - - ✓ - - -T0 (11B) 1024 1280 1e-3 - - ✓ - ✓ -Tk-Instruct (11B) 1024 -1e-5 - constant - - - - - -OPT-IML (175B) 128 2048 5e-5× linear ✓ ✓ ✓ -Flan-U-PaLM (540B) 32 -1e-3 - constant ✓ - ✓ -Sparrow (70B) RM: 8 +16, RL:16 -2e-6 ✓ cosine decay to 10% ✓ ✓× -WizardCoder (15B) 512 2048 2e-5 ✓ cosine - - - - - -Alpaca (13B) 128 512 1e-5 ✓ cosine - - ✓ ✓× -Vicuna (13B) 128 -2048 2e-5 ✓ cosine ✓ -× -LIMA (65B) 32 2048 1e-5× linear ✓ - ✓ -systems. It includes a variety of tasks that test a wide range of -linguistic phenomena, making it a comprehensive tool for eval- -uating language understanding in AI. -5.2.2. Language Understanding -WinoGrande [354]: A large-scale dataset inspired by the orig- -inal Winograd [357] Schema Challenge tests models on their -ability to resolve pronoun ambiguity and encourages the devel- -opment of models that understand the broad context in naturallanguage text. -CoQA [316]: A conversational question-answering dataset, -CoQA challenges models with questions that rely on conver- -sation history and require free-form text answers. Its diverse -content from seven domains makes it a rigorous test for mod- -els’ ability to handle a wide range of topics and conversational -contexts. -WiC [317]: This dataset assesses a model’s ability to dis- -cern word meanings based on context, aiding in tasks related -27 - ---- Page 28 --- -Table 8: Details of various well-known pre-training and fine-tuning datasets. Here, alignment means aligning with human preferences. -Dataset Type Size/Samples Tasks Source Creation Comments -C4 [10] Pretrain 806GB - Common Crawl Automated A clean, multilingual dataset with billions -of tokens -mC4 [11] Pretrain 38.49TB - Common Crawl Automated A multilingual extension of the C4 -dataset, mC4 identifies over 100 lan- -guages using cld3 from 71 monthly web -scrapes of Common Crawl. -PILE [301] Pretrain 825GB -Common Crawl, PubMed Central, -OpenWebText2, ArXiv, GitHub, -Books3, and othersAutomated A massive dataset comprised of 22 con- -stituent sub-datasets -ROOTs [302] Pretrain 1.61TB - 498 Hugging Face datasets Automated 46 natural and 13 programming lan- -guages -MassiveText [116] Pretrain 10.5TB -MassiveWeb, Books, News, -Wikipedia, Github, C4Automated 99% of the data is in English -Wikipedia [303] Pretrain - - Wikipedia Automated Dump of wikipedia -RedPajama [304] Pretrain 5TB -CommonCrawl, C4, Wikipedia, -Github, Books, StackExchangeAutomated Open-source replica of LLaMA dataset -PushShift.io Reddit Pretrain 21.1GB - Reddit Automated Submissions and comments on Reddit -from 2005 to 2019 -BigPython [140] Pretrain 5.5TB Coding GitHub Automated - -Pool of Prompt (P3) [17] Instructions 12M 62 PromptSource Manual A Subset of PromptSource, created from -177 datasets including summarization, -QA, classification, etc. -xP3 [154] Instructions 81M 71 P3+Multilingual datasets Manual Extending P3 to total 46 languages -Super-NaturalInstructions (SNI) [18] Instructions 12.4M 1616 Multiple datasets Manual Extending P3 with additional multi- -lingual datasets, total 46 languages -Flan [16] Instructions 15M 1836 Muffin+T0-SF +NIV2 Manual Total 60 languages -OPT-IML [97] Instructions 18.1M 1667 - Manual - -Self-Instruct [19] Instructions 82k 175 - Automated Generated 52k instructions with 82k sam- -ples from 175 seed tasks using GPT-3 -Alpaca [158] Instructions 52k - - Automated Employed self-instruct method to gener- -ate data from text-davinci-003 -Vicuna [159] Instructions 125k - ShareGPT Automated Conversations shared by users on -ShareGPT using public APIs -LLaMA-GPT-4 [160] Instructions 52k - Alpaca Automated Recreated Alpaca dataset with GPT-4 in -English and Chinese -Unnatural Instructions [305] Instructions 68k - 15-Seeds (SNI) Automated - -LIMA [185] Instructions 1k - Multiple datasets Manual Carefully created samples to test perfor- -mance with fine-tuning on less data -Anthropic-HH-RLHF [306] Alignment 142k - - Manual -Anthropic-HH-RLHF-2 [178] Alignment 39k - - Manual -to Word Sense Disambiguation. -Wikitext103 [318]: With over 100 million tokens from -Wikipedia’s top articles, this dataset is a rich resource for tasks -that require understanding long-term dependencies, such as lan- -guage modeling and translation. -PG19 [319]: This is a digital library of diverse books from -Project Gutenberg. It is specifically designed to facilitate re- -search in unsupervised learning and language modeling, with a -special focus on long-form content. -C4 [10]: A clean, multilingual dataset, C4 o ffers billions of to- -kens from web-crawled data. It is a comprehensive resource for -training advanced Transformer models on various languages. -LCQMC [320]: The Large-scale Chinese Question Matching -Corpus (LCQMC) is a dataset for evaluating the performance -of models in semantic matching tasks. It contains pairs of ques- -tions in Chinese and their matching status, making it a valuable -resource for research in Chinese language understanding. -5.2.3. Story Cloze and Sentence Completion -StoryCloze [334]: It introduces a new “StoryCloze Test”, a -commonsense reasoning framework for evaluating story under- -standing, generation, and script learning. It considers a model’sability to understand and generate coherent and sensible stories. -LAMBADA [335]: This dataset evaluates contextual text un- -derstanding through a word prediction task. Models must pre- -dict the last word of a passage, which is easy for humans when -given the whole passage, but not when given only the last sen- -tence. -5.2.4. Physical Knowledge and World Understanding -PIQA [340]: A dataset that probes the physical knowledge of -models, aiming to understand how well they are learning about -the real world. -TriviaQA [341]: A dataset that tests models on reading com- -prehension and open domain question answering (QA) tasks, -with a focus on Information Retrieval (IR)-style QA. -ARC [342]: A larger version of the ARC-Challenge, this -dataset contains both easy and challenging grade-school level, -multiple-choice science questions. It is a comprehensive test of -a model’s ability to understand and answer complex questions. -ARC-Easy [342]: A subset of the ARC dataset, ARC- -Easy, contains questions that are answered correctly by either -a retrieval-based algorithm or a word co-occurrence algorithm. -28 - ---- Page 29 --- -Table 9: Categorized evaluation datasets used in evaluating LLMs. -Type Datasets /Benchmarks -Multi-Task MMLU [307], SuperGLUE [2], BIG-bench [308], GLUE [309], BBH [308], CUGE [310], Zero- -CLUE [311], FewCLUE [312], Blended Skill Talk [313], HELM [314], KLUE-STS [315] -Language Understanding CoQA [316], WiC [317], Wikitext103 [318], PG19 [319], LCQMC [320], QQP [321], WinoGender [322], -CB [323], FinRE [324], SanWen [325], AFQMC [311], BQ Corpus [326], CNSS [327], CKBQA 13 [328], -CLUENER [311], Weibo [329], AQuA [330], OntoNotes [331], HeadQA [332], Twitter Dataset [333] -Story Cloze and -Sentence CompletionStoryCloze [334], LAMBADA [335], LCSTS [336], AdGen [337], E2E [338], CHID [339], CHID- -FC [312] -Physical Knowledge and -World UnderstandingPIQA [340], TriviaQA [341], ARC [342], ARC-Easy [342], ARC-Challenge [342], PROST [343], Open- -BookQA [344], WebNLG [345], DogWhistle Insider & Outsider [346] -Contextual Language -UnderstandingRACE [347], RACE-Middle [347], RACE-High [347], QuAC [348], StrategyQA [349], Quiz Bowl [350], -cMedQA [351],cMedQA2 [352], MATINF-QA [353] -Commonsense Reasoning WinoGrande [354], HellaSwag [355], COPA [356], WSC [357], CSQA [358], SIQA [359], C3[360], -CLUEWSC2020 [311], CLUEWSC [311], CLUEWSC-FC [312], ReCoRD [361] -Reading Comprehension SQuAD [362], BoolQ [363], SQUADv2 [364], DROP [365], RTE [366], WebQA [367], CMRC2017 [368], -CMRC2018 [369], CMRC2019 [370], COTE-BD [371], COTE-DP [371], COTE-MFW [371], Mul- -tiRC [372], Natural Questions [373], CNSE [327], DRCD [374], DuReader [375], Dureader robust [376], -DuReader-QG [375], SciQ [377], Sogou-log [378], Dureader robust-QG [376], QA4MRE [379], KorQuAD -1.0 [380], CAIL2018-Task1 & Task2 [381] -Mathematical Reasoning MATH [382], Math23k [383], GSM8K [384], MathQA [385], MGSM [386], MultiArith [387], AS- -Div [388], MAWPS [389], SV AMP [390] -Problem Solving HumanEval [141], DS-1000 [391], MBPP [392], APPS [382], CodeContests [142] -Natural Language Inference -& Logical ReasoningANLI [393], MNLI-m [394], MNLI-mm [394],QNLI [362], WNLI [357], OCNLI [311], CMNLI [311], -ANLI R1 [393], ANLI R2 [393], ANLI R3 [393], HANS [395], OCNLI-FC [312], LogiQA [396], Strate- -gyQA [349] -Cross-Lingual Understanding MLQA [397], XNLI [398], PAWS-X [399], XSum [400], XCOPA [401], XWinograd [402], TyDiQA- -GoldP [403], MLSum [404] -Truthfulness and Fact Checking TruthfulQA [405], MultiFC [406], Fact Checking on Fever [407] -Biases and Ethics in AI ETHOS [408], StereoSet [409], BBQ [410], Winobias [411], CrowS-Pairs [412] -Toxicity RealToxicityPrompts [413], CivilComments toxicity classification [414] -Language Translation WMT [415], WMT20 [416], WMT20-enzh [416], EPRSTMT [312], CCPM [417] -Scientific Knowledge AminoProbe [148], BioLAMA [148], Chemical Reactions [148], Galaxy Clusters [148], Mineral -Groups [148] -Dialogue Wizard of Wikipedia [418], Empathetic Dialogues [419], DPC-generated [96] dialogues, ConvAI2 [420], -KdConv [421] -Topic Classification TNEWS-FC [312], YNAT [315], KLUE-TC [315], CSL [311], CSL-FC [312], IFLYTEK [422] -It is a great starting point for models beginning to explore ad- -vanced question-answering. -ARC-Challenge [342]: A rigorous question-answering -dataset, ARC-Challenge includes complex, grade-school level -questions that demand reasoning beyond simple retrieval, test- -ing the true comprehension capabilities of models. -5.2.5. Contextual Language Understanding -RACE [347]: The RACE dataset is a reading comprehension -dataset collected from English examinations in China, which -benchmarks AI models for understanding and answering ques- -tions on long and complex passages, simulating the challenge -of a real-world examination. -RACE-Middle [347]: Another subset of the RACE [347] -dataset, RACE-Middle, contains middle school-level English -exam questions. It o ffers a slightly less challenging but academ- -ically oriented evaluation of a model’s comprehension skills. -RACE-High [347]: A subset of the RACE [347] dataset, -RACE-High consists of high school-level English exam ques-tions. It is designed to evaluate the comprehension ability of -models in a more academic and challenging context. -QuAC [348]: This dataset simulates an information-seeking -dialog between students and teachers using hidden Wikipedia -text. It introduces unique challenges not found in machine com- -prehension datasets, making it a valuable resource for advanc- -ing dialog systems. -5.2.6. Commonsense Reasoning -HellaSwag [355]: A dataset that challenges models to pick the -best ending to a context uses Adversarial Filtering to create a -‘Goldilocks’ zone of complexity, where generated text is absurd -to humans but often misclassified by models. -COPA [401]: This dataset evaluates a model’s progress in -open-domain commonsense causal reasoning. Each question -comprises a premise and two alternatives, and the model must -select the more plausible alternative, testing a model’s ability to -understand and reason about cause and e ffect. -WSC [357]: The Winograd Schema Challenge (WSC) is a -29 - ---- Page 30 --- -Table 10: An illustration of training datasets and evaluation tasks employed by pre-trained LLMs. Here, “QA” is question-answering, “Clf” is classification, “NLI” -is natural language inference, “MT” is machine translation, “RC” is reading comprehension, “CR” is commonsense reasoning, “MR” is mathematical reasoning, -“Mem.” is memorization. -Benchmark -Models Training DatasetBIG- -benchMMLUSuper -GLUEQA Clf NLI MTCloze / -CompletionRC CR MR CodingTruthful / -Bias / -Toxicity / -Mem. -T5 C4 [10] ✓ ✓ ✓✓ ✓ ✓✓✓ -GPT-3 Common Crawl, WebText, Books Cor- -pora, Wikipedia✓ ✓ ✓ ✓ ✓ ✓ -mT5 mC4 [11] ✓ ✓✓ -PanGu-α 1.1TB Chinese Text Corpus ��� ✓ ✓ ✓✓ -CPM-2 WuDaoCorpus [109] ✓ ✓ -Codex 54 million public repositories from Github ✓ -ERNIE-3.0 Chinese text corpora, Baidu Search, Web -text, QA-long, QA-short, Poetry and Cou- -plet Domain-specific data from medical, -law, and financial area Baidu knowledge -graph with more than 50 million facts✓ ✓✓✓✓ ✓ ✓ ✓ -Jurassic-1 Wikipedia, OWT, Books, C4, Pile [301], -arXiv, GitHub✓ ✓ ✓ ✓ -HyperCLOV A Korean blogs, Community sites, News, -KiN Korean Wikipedia, Wikipedia (En- -glish and Japanese), Modu-Corpus: Mes- -senger, News, Spoken and written lan- -guage corpus, Web corpus✓ -Yuan 1.0 Common Crawl, SogouT, Sogou News, -Baidu Baike, Wikipedia, Books✓✓✓ ✓ -Gopher subsets of MassiveWeb Books, C4, News, -GitHub and Wikipedia samples from Mas- -siveText✓ ✓ ✓ ✓ ✓✓ ✓ -ERNIE-3.0 TITAN Same as ERNIE 3.0 and ERNIE 3.0 ad- -versarial dataset, ERNIE 3.0 controllable -dataset✓✓✓ ✓ ✓ -GPT-NeoX-20B Pile [301] ✓ ✓ ✓ ✓ ✓✓ -OPT RoBERTa [299], Pile [301], PushShift.io -Reddit [423]✓✓ ✓ ✓ -BLOOM ROOTs [13] ✓ ✓✓ ✓ ✓ ✓ -Galactica arXiv, PMC, Semantic Scholar, Wikipedia, -StackExchange, LibreText, Open Text- -books, RefSeq Genome, OEIS, LIPID -MAPS, NASAExoplanet, Common Crawl, -ScientificCC, AcademicCC, GitHub repos- -itories Khan Problems, GSM8K, OneS- -mallStep✓ ✓ ✓ ✓ ✓ -GLaM Filtered Webpages, Social media conversa- -tions Wikipedia, Forums, Books, News✓ ✓ ✓ ✓✓ -LaMDA Infiniset : Public documents, Dialogs, Ut- -terances✓ -MT-NLG Two snapshots of Common Crawl and -Books3, OpenWebText2, Stack Exchange, -PubMed Abstracts, Wikipedia, PG-19 -[242], BookCorpus2, NIH ExPorter, Pile, -CC-Stories, RealNews✓ ✓ ✓✓ ✓ -AlphaCode Selected GitHub repositories, CodeCon- -tests: Codeforces, Description2Code, Co- -deNet✓ -Chinchilla MassiveWeb, MassiveText Books, C4, -News, GitHub, Wikipedia✓ ✓ ✓ ✓✓ ✓ -PaLM webpages, books, Wikipedia, news, arti- -cles, source code, social media conversa- -tions✓ ✓ ✓ ✓ ✓ ✓ -AlexaTM Wikipedia, mC4 ✓ ✓✓ ✓ ✓ -U-PaLM Same as PaLM ✓ ✓ ✓ ✓ ✓ ✓✓ -UL2 - ✓ ✓✓✓ ✓ ✓ -GLM-130B - ✓ ✓ ✓ -CodeGen Pile, BigQuery, BigPython ✓ -LLaMA CommonCrawl, C4, Github, Wikipedia, -Books, arXiv, StackExchange✓ ✓ ✓✓✓ ✓ ✓ -PanGu- Σ WuDaoCorpora, CLUE, Pile, C4, Python -code✓✓✓✓ ✓ ✓ -BloombergGPT inPile, Pile, C4, Wikipedia ✓ ✓ ✓ ✓ ✓✓ ✓ -CodeT5 + CodeSearchNet, Github Code ✓ ✓ -StarCoder The Stack v1.2 ✓ ✓ ✓ ✓ -LLaMA-2 ✓ ✓ ✓ ✓✓✓ ✓ -PaLM-2 Web documents, Code, Books, Maths, -Conversation✓ ✓✓✓✓ ✓ ✓✓✓ ✓ ✓ -30 - ---- Page 31 --- -Table 11: An illustration of training datasets and evaluation benchmarks used in fine-tuned LLMs. “SNI” is a short of Super-NaturalInsturctions. -Models Training DatasetBIG- -benchMMLU BBH RAFT FLAN SNI PromptSource TyDiQA HumanEval MBPPTruthful / -Bias / -Toxicity -T0 Pool of Prompts ✓ -WebGPT ELI5 [424], ELI5 fact- -check [166], TriviaQA [341], -ARC-Challenge [342], ARC- -Easy [342], Hand-written data, -Demonstrations of humans, Com- -parisons between model-generated -answers✓ -Tk-INSTRUCT SNI [18] ✓ -mT0 xP3 [154] -OPT-IML PromptSource [17], FLAN [16], -SNI [425], UnifiedSKG [426], -CrossFit [427], ExMix [428], -T5 [10], Reasoning✓ ✓ ✓ ✓ ✓ ✓ -Flan Muffin, T0-SF, NIv2, CoT ✓ ✓ ✓ -WizardCoder Code Alpaca ✓ ✓ -reading comprehension task in which a system must resolve -references in a text, often requiring world knowledge and rea- -soning about the text. -CSQA [358]: The CommonsenseQA is a question-answering -dataset that requires commonsense knowledge to evaluate the -ability of AI models to understand and answer questions. -5.2.7. Reading Comprehension -BoolQ [363]: A dataset derived from Google search queries, -BoolQ challenges models to answer binary (yes /no) questions. -The questions are naturally occurring and are paired with a -paragraph from a Wikipedia article containing the answer. It -is a test of reading comprehension and reasoning. -SQUADv2 [364]: The Stanford Question Answering Dataset -(SQuAD) [362] is a collection of questions posed by crowd -workers on a set of Wikipedia articles, where the answer to ev- -ery question is a segment of text from the corresponding reading -passage. SQuADv2 combines the original SQuAD1.1 dataset -with over 50,000 unanswerable questions. The aim is to evalu- -ate a model’s ability to understand and answer questions based -on a given context and to determine when a question is unan- -swerable. -DROP [365]: DROP, or Discrete Reasoning Over the con- -tent of Paragraphs, is designed to test a model’s ability to un- -derstand a wide variety of reading phenomena. It encourages -comprehensive and reliable evaluation of reading comprehen- -sion capabilities. -RTE [366]: The Recognizing Textual Entailment (RTE) -datasets come from a series of annual competitions on textual -entailment, predicting whether a given sentence logically fol- -lows from another and evaluating a model’s understanding of -logical relationships in a text. -WebQA [367]: A dataset for open-domain question answering, -WebQA o ffers a large collection of web-based question-answer -pairs. It is designed to assess the ability of AI models to under- -stand and answer questions based on web content. -CMRC2018 [369]: This dataset is a test of Chinese language -models’ ability to reason comprehensively and is designed with -a challenging span-extraction format that pushes the boundariesof machine performance. -5.2.8. Mathematical Reasoning -MATH [382]: This dataset is a platform for evaluating the -mathematical problem-solving abilities of AI models. It con- -tains a diverse set of math problems, ranging from arithmetic -to calculus, and is designed to test the model’s ability to under- -stand and solve complex mathematical problems. -Math23k [383]: This one challenges a model’s ability to un- -derstand and solve mathematical word problems. It contains -23,000 Chinese arithmetic word problems that require models -to perform reasoning and computation based on the problem -description. -GSM8K [384]: A dataset of diverse grade school math word -problems, testing a model’s ability to perform multi-step math- -ematical reasoning. -5.2.9. Problem Solving and Logical Reasoning -ANLI [393]: A large-scale dataset designed to test the robust- -ness of machine learning models in Natural Language Inference -(NLI) is created through an iterative, adversarial process where -humans try to generate examples that models cannot correctly -classify. -HumanEval [141]: A dataset for evaluating the problem- -solving ability of AI models, which includes a diverse set of -tasks that require various cognitive abilities, making it a com- -prehensive tool for assessing general intelligence in AI. -StrategyQA [349]: A question-answering dataset that re- -quires reasoning over multiple pieces of evidence to evaluate -the strategic reasoning ability of AI models, pushing the bound- -aries of what machines can understand and answer. -5.2.10. Cross-Lingual Understanding -XNLI [398]: A cross-lingual benchmark, XNLI extends the -MultiNLI [429] corpus to 15 languages, including low-resource -ones like Urdu. It tests models on cross-lingual sentence under- -standing, with 112,500 annotated pairs across three categories: -entailment, contradiction, and neutral. -PAWS-X [399]: PAWS-X, or Cross-lingual Paraphrase Adver- -saries from Word Scrambling, is a multilingual version of the -31 - ---- Page 32 --- -PAWS [430] dataset for paraphrase identification. It includes -examples in seven languages and is designed to evaluate the -performance of cross-lingual paraphrase identification models. -5.2.11. Truthfulness -Truthful-QA [405]: A unique benchmark that measures a -language model’s truthfulness when generating answers. The -dataset includes questions across various categories like health, -law, and politics, some designed to test the model against com- -mon human misconceptions. -5.2.12. Biases and Ethics in AI -ETHOS [408]: ETHOS is a hate speech detection dataset -built from YouTube and Reddit comments. It is a tool in the -fight against online hate speech, o ffering binary and multi-label -variants for robust content moderation. -StereoSet [409]: StereoSet is a comprehensive dataset de- -signed to measure and evaluate the presence of stereotypical -biases in language models. It focuses on four key domains: -gender, profession, race, and religion. Contrasting stereotypi- -cal bias against language modeling ability provides a valuable -tool for understanding and mitigating biases in large language -models. -6. Applications -Applying Large Language Models (LLMs) to a variety of -downstream tasks has become a popular trend in both AI- -related research communities and industries, with many emerg- -ing uses being discovered and explored daily. LLMs, which are -capable of understanding and generating human-like text, have -found meaningful applications across a variety of fields. This -section provides an overview of LLM applications in medicine, -education, science, mathematics, law, finance, robotics, and -coding. While each of these domains pose di fferent challenges, -LLMs open up opportunities to make significant contributions -to these domains through their generalizability. -General Purpose: LLMs are being widely considered as -general-purpose tools for a wide variety of tasks [431]. This -is due to their inherent ability to understand, generate, and -manipulate human-like text in a contextually relevant man- -ner. This allows them to perform tasks ranging from simple -language translation and question-answering to more complex -tasks like summarization, text generation, and even program- -ming help [432]. The utility of LLMs is further enhanced by -their ability to adapt to the specific style and tone of the text -they are processing, making the outputs more user-friendly and -context-aware. In everyday applications, LLMs can be used -as personal assistants, helping users draft emails or schedule -appointments [433]; they can also be deployed in customer ser- -vice to handle common questions or applied to generate content -for digital platforms like websites by creating human-like text -based on given prompts [434]. Moreover, LLMs play a cru- -cial role in data analysis, where they can filter large volumes of -text data, summarize key points, and find patterns that would -take humans much longer to identify [435]. Despite their wide- -ranging applications, it is essential to remember that LLMs,similar to any AI system, are only as good as the data they have -been trained on. -Medicine: The application of LLMs in the field of medicine is -reshaping healthcare delivery and research. For example, LLMs -are increasingly used in clinical decision support systems to -provide physicians with evidence-based treatment recommen- -dations [436, 437, 438]. By analyzing patient data and medical -literature, they can help identify potential diagnoses, suggest -appropriate tests, and recommend optimal treatment strategies. -Moreover, LLMs can also enhance patient interactions with -healthcare systems; e.g., they can be used in chatbot applica- -tions [439, 440, 441] to answer patient queries about symptoms -or medications, schedule appointments, and even provide es- -sential health advice. For medical research, LLMs are used to -extract and filter information from a considerable amount of -medical literature, identify relevant studies, summarize find- -ings, and even predict future research trends [442, 443, 444]. -For medical education, LLMs can help create training mate- -rials, generate exam questions, provide detailed explanations -of complex medical topics, and o ffer personalized feedback to -students [445, 446, 447, 448]. They can also simulate patient -interactions, enabling students to practice and improve their -clinical skills. At a broader level, LLMs can assist in public -health initiatives by analyzing media data to detect disease out- -breaks, monitor public sentiment towards health policies, and -disseminate health information in a clear and understandable -manner [449]. LLMs can be employed to support public health -initiatives, addressing related issues such as data privacy, the -necessity for explainability, and the potential risk of propagat- -ing biases [450, 451]. -Education: The integration of LLMs into the educational sec- -tor offers opportunities to enhance learning experiences, teacher -support, and educational content development. For students, by -analyzing their learning styles, performance, and preferences, -LLMs can provide customized study materials and practice -questions to develop personalized learning experiences [452]. -For teachers, LLMs can help to create lesson plans and grade -assignments and generate diverse and inclusive educational -content, significantly saving more time for teaching and student -interaction [453, 454]. In language learning, LLMs serve as -advanced conversational partners capable of simulating conver- -sations in multiple languages, correcting grammar, enhancing -vocabulary, and aiding pronunciation for the needs of fluency -in practice [455]. Furthermore, LLMs improve accessibility -in education by providing support for students with disabili- -ties. They can generate real-time transcriptions for the hear- -ing impaired, o ffer reading assistance for the visually impaired, -and simplify complex texts for those with learning disabili- -ties [451]. As LLMs continue to evolve, their applications in -education can benefit more students and teachers from di fferent -perspectives in practice. -Science: Similar to medical applications, LLMs can expedite -the research process by quickly analyzing and summarizing sci- -entific literature. By briefing comprehensible and accessible re- -search summaries, LLMs can assist researchers in staying up- -to-date with the latest findings, even in fields outside their area -of expertise [456, 457]. In addition, LLMs can aid scientists -32 - ---- Page 33 --- -Table 12: Performance comparison of top performing LLMs across various NLU and NLG tasks. Here, “N-Shots” indicate the number of example prompts provided -to the model during the evaluation, representing its capability in few-shot or zero-shot learning settings, “f” represents the fine-tuned version, and “B” represents the -benchmark. -Task Dataset /BenchmarkTop-1 Top-2 Top-3 -Model (Size) Score (N-shots) Model (Size) Score (N-shots) Model (Size) Score (N-shots) -Multi-TaskBIG-bench (B) Chinchilla (70B) 65.1 (5-shot) Gopher (280B) 53.97 (5-shot) PaLM (540B) 53.7 (5-shot) -MMLU (B) GPT-4 (-) 86.4 (5-shot) Gemini (Ultra) 83.7 (5-shot) Flan-PaLM-2 (f)(Large) 81.2 (5-shot) -Language Understanding SuperGLUE (B) ERNIE 3.0 (12B) 90.6 (-) PaLM (f)(540B) 90.4 (-) T5 (11B) 88.9 (-) -Story Comprehension and -GenerationHellaSwag GPT-4 (-) 95.3 (10-shot) Gemini (Ultra) 87.8 (10-shot) PaLM-2 (Large) 86.8 (one shot) -StoryCloze GPT3 (175B) 87.7 (few shot) PaLM-2 (Large) 87.4 (one shot) OPT (175B) 79.82 (-) -Physical Knowledge and -World UnderstandingPIQA PaLM-2 (Large) 85.0 (one shot) LLaMa (65B) 82.8 (zero shot) MT-NLG (530B) 81.99 (zero shot) -TriviaQA PaLM-2 (Large) 86.1 (one shot) LLaMA-2 (70B) 85.0 (one shot) PaLM (540B) 81.4 (one shot) -Contextual Language -UnderstandingLAMBADA PaLM (540B) 89.7 (few shot) MT-NLG (530B) 87.15 (few shot) PaLM-2 (Large) 86.9 (one shot) -Commonsense ReasoningWinoGrande GPT-4 (-) 87.5 (5-shot) PaLM-2 (Large) 83.0 (one shot) PaLM (540B) 81.1 (zero shot) -SIQA LLaMA (65B) 52.3 (zero shot) Chinchilla (70B) 51.3 (zero shot) Gopher (280B) 50.6 (zero shot) -Reading Comprehension BoolQ PaLM (f)(540B) 92.2 (-) T5 (11B) 91.2 (-) PaLM-2 (Large) 90.9 (one shot) -Truthfulness Truthful-QA LLaMA (65B) 57 (-) -Mathematical ReasoningMATH Gemini (Ultra) 53.2 (4-shot) PaLM-2 (Large) 34.3 (4-shot) LLaMa-2 (65B) 13.5 (4-shot) -GSM8K GPT-4 (-) 92.0 (5-shot) PaLM-2 (Large) 80.7 (8-shot) U-PaLM (540B) 58.5 (-) -Problem Solving and -Logical ReasoningHumanEval Gemini (f)(Ultra) 74.4 (zero shot) GPT-4 (-) 67.0 (zero shot) Code Llama (34B) 48.8 (zero shot) -in formulating new hypotheses and research questions since -their ability to process large-scale datasets allows them to un- -veil insights that might not be immediately apparent to human -researchers [458]. Moreover, for scientific writing, LLMs can -help researchers draft documents, suggest improvements, and -ensure adherence to specific formatting guidelines [459, 460]. -This not only saves time but also improves the clarity of scien- -tific communication, enabling interdisciplinary teams to work -together more e ffectively. -Maths: In addition to providing mathematical research and -education support, LLMs can assist in solving mathematical -problems by giving step-by-step explanations and guiding users -through complex proofs and calculations. They can help iden- -tify errors in reasoning or computation and suggest corrections, -serving as an invaluable tool for both learning and verification -purposes [461, 462]. LLMs can be employed to check the valid- -ity of mathematical proofs, o ffering a preliminary filter before -human review. While they are not a substitute for the meticu- -lous work of mathematicians, they can help simplify the process -of proof verification [463, 464]. Moreover, LLMs enhance ac- -cessibility to mathematics by translating complex concepts and -findings into understandable language for non-specialists [465], -where the gap between theoretical mathematics and applied -contexts such as physics, engineering, and economics can be -bridged. -Law: LLMs can assist with the thematic analysis of legal doc- -uments, including generating initial coding for datasets, iden- -tifying themes, and classifying data according to these themes. -This collaborative e ffort between legal experts and LLMs has -proved to be e ffective in analyzing legal texts such as court -opinions on theft, improving both the e fficiency and quality of -the research [466]. Additionally, LLMs have been evaluated for -their ability to generate explanations of legal terms, focusing -on improving factual accuracy and relevance by incorporating -sentences from case law. By feeding relevant case law into the -LLM, the augmented models can generate higher-quality expla- -nations with less factually incorrect information [467]. More- -over, LLMs can be trained with specialized domain knowledgeto perform legal reasoning tasks [468] and answer legal ques- -tions [469]. -Finance: LLMs like BloombergGPT [151], trained on exten- -sive proprietary financial datasets, exhibit superior performance -on financial tasks. This indicates the value of domain-specific -training in creating LLMs that can more accurately understand -and process industry-specific language and concepts. The intro- -duction of FinGPT [470] as an open-source model o ffers trans- -parent and accessible resources to develop novel applications -such as robo-advising, algorithmic trading, and low-code so- -lutions, ultimately expanding the capabilities of financial ser- -vices. Both BloombergGPT and FinGPT show the adaptabil- -ity of LLMs to the financial domain, with the former showing -the power of custom datasets and the latter emphasizing a data- -centric approach and low-rank adaptation techniques for cus- -tomization. Moreover, LLMs demonstrate an ability to break -down complex financial tasks into actionable plans, enabling -end-to-end solutions that were previously unfeasible with a sin- -gle model [471]. -Robotics: In robotics research, LLMs have promising appli- -cations, such as enhancing human-robot interaction [28, 472, -473, 474], task planning [237], motion planning [246], nav- -igation [246, 475], object manipulation [236], personalized -robots [476], etc. LLMs enable robots to understand the en- -vironment e ffectively and generate plans to complete tasks col- -laboratively [240, 26]. They can facilitate continuous learning -by allowing robots to access and integrate information from a -wide range of sources, helping robots acquire new skills, adapt -to changes, and refine their paths [224, 233, 234]. -7. Challenges and Future Directions -LLMs such as GPT-4 and its predecessors have significantly -advanced natural language processing. Nevertheless, they also -bring along a set of challenges. The computational cost, ad- -versarial robustness, and interpretability are among the tech- -nical challenges that are intrinsic to these models. Further- -more, as these models are scaled up to handle more complex -33 - ---- Page 34 --- -tasks or to operate in more complex or dynamic environments, -new challenges in scalability, privacy, and real-time processing -emerge. On the frontier of foundational research, integrating -multi-modality and the e ffectiveness of transfer learning are be- -ing keenly explored. Additionally, the continuous learning as- -pect of these models, which aims to have models that can adapt -to new information over time, presents a fresh set of challenges. -These challenges not only underscore the technical intricacies -involved but also highlight the broader impact and the future -trajectory of LLMs in real-world applications. The following -sections delve into these challenges, shedding light on the on- -going and potential e fforts to address them. -Computational Cost: Training LLMs require extensive compu- -tational resources, which increases production costs and raises -environmental concerns due to substantial energy consump- -tion during large-scale training. Improved performance occurs -as computational resources increase, but the rate of improve- -ment gradually decreases when both the model and dataset -size remain fixed, following the power law of diminishing re- -turns [477]. -Bias and Fairness: LLMs can inherit and amplify societal bi- -ases in their training data. These biases can manifest in the -model’s outputs, leading to potential ethical and fairness is- -sues [478]. -Overfitting: Although LLMs possess substantial learning ca- -pabilities, they are susceptible to overfitting noisy and peculiar -patterns within their extensive training data. Consequently, this -may cause them to generate illogical responses [479]. The de- -bate about Memorization vs. Generalization in LLMs is about -finding the right balance. Memorization allows the model to -remember specific details from its training data, ensuring it can -provide accurate answers to precise questions. However, gen- -eralization enables the model to make inferences and produce -responses for inputs it has not seen before, which is essential -for handling various real-world tasks. Striking the right bal- -ance is the challenge: too much memorization can lead to over- -fitting, making the model inflexible and struggling with new -inputs [480]. -Economic and Research Inequality: The high cost of train- -ing and deploying LLMs may make their development concen- -trated within well-funded organizations, potentially worsening -economic and research inequalities in AI [481]. -Reasoning and Planning: Some reasoning and planning tasks, -even as seemingly simple as common-sense planning, which -humans find easy, remain well beyond the current capabilities -of LLMs evaluated using an assessment framework. This is not -entirely unexpected, considering that LLMs primarily generate -text completions based on likelihood and o ffer no solid guaran- -tees in terms of reasoning abilities [482]. -Hallucinations: LLMs exhibit “hallucinations", where they -generate responses that, while sounding plausible, are incorrect -or do not align with the provided information [483]. Hallucina- -tions can be categorized into three categories. -•Input-conflicting hallucination, wherein LLMs produce -content that diverges from the input given by users. -•Context-conflicting hallucination, where LLMs generatecontent that contradicts information they have generated -earlier. -•Fact-conflicting hallucination involves LLM’s generation -of content that does not align with established world -knowledge. -Prompt Engineering: Prompts serve as inputs to LLMs, and -their syntax and semantics play a crucial role in determining -the model’s output. The prompt variations, sometimes counter- -intuitive to humans, can result in significant changes in model -output and are addressed through prompt engineering, which -involves designing natural language queries to guide LLMs -responses e ffectively [484, 32]. -Limited Knowledge: Information acquired during pretraining -is limited and may become obsolete after some time. Re- -training the model using updated data is costly. To generate -factually accurate responses, people use a retrieval augmen- -tation pipeline [198]. However, pre-trained models are not -trained with retrieval augmentation generation (RAG) [6, 21]; -hence, adapting the training pipeline is necessary [193, 25]. -Safety and Controllability: Using LLMs comes with the risk -of generating harmful, misleading, or inappropriate content, -whether by accident or when given specific prompts. Ensuring -these models are safely utilized is a significant concern [485]. -Security and Privacy: LLMs are prone to leaking personal -information and generating false, unethical, misaligned re- -sponses. Researchers have explored various security attacks, -i.e., backdoor attacks, jailbreaking, prompt injection, and data -poisoning, that lead to breaking LLMs security. Therefore, -developing better defense mechanisms is essential to ensure -LLMs are safe, reliable, and trustworthy for complex AI -applications [486]. -Multi-Modality: Multi-modal learning, where LLMs are -trained on diverse data like text, images, and videos, aims to -create models with richer understanding but faces challenges -in data alignment, fusion strategies, and higher computational -demands. -Catastrophic Forgetting: LLMs are often pre-trained on -large datasets and then fine-tuned on domain-specific data, -reducing training resources. However, they face issues like -domain adaptation and catastrophic forgetting, which hinder -the retention of original knowledge when learning new tasks. -Adversarial Robustness: Large Language Models (LLMs) -have shown great capabilities in various tasks but are vul- -nerable to adversarial attacks, where slight, deliberate input -alterations can mislead them. Especially with models like -BERT, adversarial fine-tuning can enhance robustness, al- -though it sometimes compromises generalization [487]. As -LLMs integrate more into complex systems, examining their -security properties becomes crucial, given the emerging field -of adversarial attacks on LLMs within trustworthy ML [488]. -This vulnerability is notable in safety-critical domains, ne- -cessitating robust adversarial evaluation tools to ensure LLM -reliability [489]. -Interpretability and Explainability: The “black-box” nature -of LLMs poses challenges in understanding their decision- -making, which is crucial for broader acceptance and trust, -34 - ---- Page 35 --- -especially in sensitive domains. Despite their advanced -capabilities, the lack of insight into their operation limits their -effectiveness and trustworthiness [490, 491]. E fforts are being -made to make LLMs more explainable to promote user trust -and to ensure responsible AI usage. Understanding the logic -behind LLMs’ responses is essential for fostering trust and -ensuring they align with human values and legal standards. -Privacy Concerns: Privacy concerns in Large Language -Models (LLMs) have escalated with their growth in complexity -and size, particularly around data sharing and potential misuse. -There is a risk of malicious content creation, filter bypass, -and data privacy issues, especially in e-commerce, where -protecting customer privacy is crucial. If models are trained -on private data, additional concerns arise if such models are -made publicly available. LLMs tend to memorize phrases from -their training sets, which an adversary could exploit to extract -sensitive data, posing a threat to personal privacy [492, 493]. -Real-Time Processing: Real-time processing in Large Lan- -guage Models (LLMs) is pivotal for various applications, -especially with the rising popularity of mobile AI applications -and concerns regarding information security and privacy. -However, LLMs often have hundreds of layers and millions -of parameters, which impede real-time processing due to the -high computational demands and limited weight storage on -hardware platforms, particularly in edge computing environ- -ments [494]. While certain e fforts like MobileBERT aim -to reduce memory requirements, they still face substantial -execution overhead due to the large number of model layers, -leading to high inference latency. -Long-Term Dependencies: Large Language Models have -shown considerable progress in understanding and generating -text, yet they often struggle with preserving context and -handling long-term dependencies, particularly in complex, -multi-turn conversations or long documents. This limitation -can lead to incoherent or irrelevant responses. -Hardware Acceleration: The growth of LLMs presents signif- -icant hardware challenges due to the increasing computational -and memory demands associated with training and deploying -these models. GPUs have played a crucial role in meeting the -hardware requirements for training LLMs, with the networking -industry also evolving to optimize hardware for training -workloads. However, the growing size of LLMs, which has -been outpacing hardware progress, makes model inference in- -creasingly costly. Model quantization is a promising approach -to bridge the widening gap between LLM size and hardware -capacity [495]. Although specialized hardware acceleration -like GPUs or TPUs can significantly reduce the computational -cost, making real-time applications more feasible, they may not -fully resolve all limitations, necessitating further advancements -in hardware technology. -Regulatory and Ethical Frameworks: The rapid advancements -in artificial intelligence have given rise to sophisticated Large -Language Models (LLMs) like OpenAI’s GPT-4 [157] and -Google’s Bard. These developments underscore the imperative -for regulatory oversight to manage the ethical and social -challenges accompanying LLMs’ widespread use [496]. For -instance, LLMs can generate content that can be used posi-tively or negatively, emphasizing the need for proactive ethical -frameworks and policy measures to guide their responsible -use and assign accountability for their outputs [497]. Auditing -is identified as a promising governance mechanism to ensure -that AI systems, including LLMs, are designed and deployed -ethically, legally, and technically robust [498]. -8. Conclusion -This article has comprehensively reviewed the develop- -ments in LLMs. It contributes to summarizing significant -findings of LLMs in the existing literature and provides a -detailed analysis of the design aspects, including architec- -tures, datasets, and training pipelines. We identified crucial -architectural components and training strategies employed by -different LLMs. These aspects are presented as summaries -and discussions throughout the article. Moreover, we have -discussed the performance di fferences of LLMs in zero-shot -and few-shot settings, explored the impact of fine-tuning, and -compared supervised and generalized models and encoder vs. -decoder vs. encoder-decoder architectures. A comprehensive -review of multi-modal LLMs, retrieval augmented LLMs, -LLMs-powered agents, e fficient LLMs, datasets, evaluation, -applications, and challenges is also provided. This article is -anticipated to serve as a valuable resource for researchers, -offering insights into the recent advancements in LLMs and -providing fundamental concepts and details to develop better -LLMs. -Acknowledgement: The author /s would like to acknowl- -edge the support received from Saudi Data and AI Authority -(SDAIA) and King Fahd University of Petroleum and Miner- -als (KFUPM) under SDAIA-KFUPM Joint Research Center for -Artificial Intelligence Grant No. JRC-AI-RFP-11. -References -[1] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his- -tory” for natural language processing?, in: Machine Learning and -Knowledge Discovery in Databases. Research Track: European Con- -ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, -Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1 -[2] A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, -O. Levy, S. Bowman, Superglue: A stickier benchmark for general- -purpose language understanding systems, Advances in neural informa- -tion processing systems 32 (2019). 1, 26, 29 -[3] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, -Z. Yang, A. Kulshreshtha, G. Nemade, Y . Lu, et al., Towards a human- -like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1 -[4] B. A. y Arcas, Do large language models understand us?, Daedalus -151 (2) (2022) 183–197. 2 -[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., -Language models are unsupervised multitask learners, OpenAI blog -1 (8) (2019) 9. 2, 7 -[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, -A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models -are few-shot learners, Advances in neural information processing sys- -tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 18, 23, 24, 25, 34 -[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training -of deep bidirectional transformers for language understanding, arXiv -preprint arXiv:1810.04805 (2018). 2, 18, 24 -35 - ---- Page 36 --- -[8] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, -L. Zettlemoyer, Deep contextualized word representations, in: NAACL- -HLT, Association for Computational Linguistics, 2018, pp. 2227–2237. -2 -[9] M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, -V . Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre- -training for natural language generation, translation, and comprehen- -sion, arXiv preprint arXiv:1910.13461 (2019). 2 -[10] C. Ra ffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, -Y . Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with -a unified text-to-text transformer, The Journal of Machine Learning Re- -search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31 -[11] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, -A. Barua, C. Ra ffel, mt5: A massively multilingual pre-trained text-to- -text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24, -25, 28, 30 -[12] Z. Zhang, Y . Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y . Yao, F. Qi, -J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-e ffective pre-trained lan- -guage models, AI Open 2 (2021) 216–224. 2, 8, 25 -[13] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, -R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b- -parameter open-access multilingual language model, arXiv preprint -arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30 -[14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, -M. Diab, X. Li, X. V . Lin, et al., Opt: Open pre-trained transformer -language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24, -25 -[15] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, -P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal- -ing language modeling with pathways, arXiv preprint arXiv:2204.02311 -(2022). 2, 6, 9, 11, 23, 24, 25 -[16] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, -X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned -language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16, -17, 22, 24, 25, 28, 31 -[17] V . Sanh, A. Webson, C. Ra ffel, S. H. Bach, L. Sutawika, Z. Alyafeai, -A. Cha ffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask -prompted training enables zero-shot task generalization, arXiv preprint -arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31 -[18] Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, -A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al., -Super-naturalinstructions: Generalization via declarative instructions on -1600+nlp tasks, in: Proceedings of the 2022 Conference on Empirical -Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7, -11, 16, 17, 24, 25, 28, 31 -[19] Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha- -jishirzi, Self-instruct: Aligning language model with self generated in- -structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28 -[20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, -C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod- -els to follow instructions with human feedback, Advances in Neural In- -formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16, -22 -[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, -N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open -foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 -(2023). 2, 7, 10, 16, 25, 34 -[22] J. Wei, Y . Tay, R. Bommasani, C. Ra ffel, B. Zoph, S. Borgeaud, D. Yo- -gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of -large language models, arXiv preprint arXiv:2206.07682 (2022). 2 -[23] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large -language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2 -[24] D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci- -entific research capabilities of large language models, arXiv preprint -arXiv:2304.05332 (2023). 2 -[25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, -J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with -retrieval augmented language models, arXiv preprint arXiv:2208.03299 -(2022). 2, 18, 19, 34 -[26] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, -A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodiedmultimodal language model, arXiv preprint arXiv:2303.03378 (2023). -2, 20, 22, 33 -[27] A. Parisi, Y . Zhao, N. Fiedel, Talm: Tool augmented language models, -arXiv preprint arXiv:2205.12255 (2022). 2, 19, 20 -[28] B. Zhang, H. Soh, Large language models as zero-shot human models -for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2, -33 -[29] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, -Y . Shi, et al., mplug-owl: Modularization empowers large language -models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2, -22 -[30] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, -T. Lu, J. Zhou, Y . Qiao, et al., Visionllm: Large language model -is also an open-ended decoder for vision-centric tasks, arXiv preprint -arXiv:2305.11175 (2023). 2, 22 -[31] R. Yang, L. Song, Y . Li, S. Zhao, Y . Ge, X. Li, Y . Shan, Gpt4tools: -Teaching large language model to use tools via self-instruction, arXiv -preprint arXiv:2305.18752 (2023). 2, 19, 22, 23 -[32] E. Saravia, Prompt Engineering Guide, https: //github.com /dair- -ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34 -[33] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, -W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained -model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25 -[34] Y . Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5 +: -Open code large language models for code understanding and genera- -tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25 -[35] S. Wang, Y . Sun, Y . Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, -Y . Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl- -edge enhanced pre-training for language understanding and generation, -arXiv preprint arXiv:2112.12731 (2021). 2, 8, 24, 25 -[36] J. Rasley, S. Rajbhandari, O. Ruwase, Y . He, Deepspeed: System op- -timizations enable training deep learning models with over 100 billion -parameters, in: Proceedings of the 26th ACM SIGKDD International -Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505– -3506. 2, 5 -[37] S. Rajbhandari, J. Rasley, O. Ruwase, Y . He, Zero: Memory optimiza- -tions toward training trillion parameter models, in: SC20: International -Conference for High Performance Computing, Networking, Storage and -Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24 -[38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards -a unified view of parameter-e fficient transfer learning, arXiv preprint -arXiv:2110.04366 (2021). 2, 20, 21 -[39] Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po- -ria, Llm-adapters: An adapter family for parameter-e fficient fine-tuning -of large language models, arXiv preprint arXiv:2304.01933 (2023). 2, -20 -[40] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- -efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8, -20, 21 -[41] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for -generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21 -[42] X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of -large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 22 -[43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang, -From dense to sparse: Contrastive pruning for better pre-trained lan- -guage model compression, in: Proceedings of the AAAI Conference on -Artificial Intelligence, V ol. 36, 2022, pp. 11547–11555. 2, 22 -[44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant: -Accurate and e fficient post-training quantization for large language -models, in: ICML, V ol. 202 of Proceedings of Machine Learning Re- -search, PMLR, 2023, pp. 38087–38099. 2, 21 -[45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong, -Compression of generative pre-trained language models via quantiza- -tion, arXiv preprint arXiv:2203.10705 (2022). 2, 21 -[46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu, -Giraffe: Adventures in expanding context lengths in llms, arXiv preprint -arXiv:2308.10882 (2023). 2, 17 -[47] B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: E fficient con- -text window extension of large language models, arXiv preprint -arXiv:2309.00071 (2023). 2, 17 -[48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, Y . Yang, -36 - ---- Page 37 --- -Longt5: E fficient text-to-text transformer for long sequences, arXiv -preprint arXiv:2112.07916 (2021). 2, 18 -[49] S. Chen, S. Wong, L. Chen, Y . Tian, Extending context window -of large language models via positional interpolation, arXiv preprint -arXiv:2306.15595 (2023). 2, 17 -[50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, -J. Zhang, Z. Dong, et al., A survey of large language models, arXiv -preprint arXiv:2303.18223 (2023). 2, 3, 7 -[51] U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur- -vey on word representation models: From classical to state-of-the-art -word representation language models, Transactions on Asian and Low- -Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3 -[52] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, -E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro- -cessing via large pre-trained language models: A survey, arXiv preprint -arXiv:2111.01243 (2021). 2, 3 -[53] C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, -L. He, et al., A comprehensive survey on pretrained foundation models: -A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023). -2, 3 -[54] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, -J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint -arXiv:2301.00234 (2022). 2, 7, 18 -[55] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: -A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18 -[56] Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, -Q. Liu, Aligning large language models with human: A survey, arXiv -preprint arXiv:2307.12966 (2023). 2 -[57] X. Zhu, J. Li, Y . Liu, C. Ma, W. Wang, A survey on model compression -for large language models, arXiv preprint arXiv:2308.07633 (2023). 2 -[58] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi- -modal large language models, arXiv preprint arXiv:2306.13549 (2023). -2, 22, 23 -[59] J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL- -ING 1992 volume 4: The 14th international conference on computa- -tional linguistics, 1992. 4 -[60] T. Kudo, Subword regularization: Improving neural network translation -models with multiple subword candidates, in: Proceedings of the 56th -Annual Meeting of the Association for Computational Linguistics (V ol- -ume 1: Long Papers), 2018, pp. 66–75. 4 -[61] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare -words with subword units, in: Proceedings of the 54th Annual Meet- -ing of the Association for Computational Linguistics (V olume 1: Long -Papers), 2016, pp. 1715–1725. 4 -[62] M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 -IEEE international conference on acoustics, speech and signal process- -ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4 -[63] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Ra ffel, M. Dey, M. Gallé, -A. Raja, C. Si, W. Y . Lee, B. Sagot, et al., Between words and char- -acters: A brief history of open-vocabulary modeling and tokenization in -nlp, arXiv preprint arXiv:2112.10508 (2021). 4 -[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, -Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural -information processing systems 30 (2017). 4, 7 -[65] O. Press, N. Smith, M. Lewis, Train short, test long: Attention with -linear biases enables input length extrapolation, in: International Con- -ference on Learning Representations, 2022. -URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17 -[66] J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, Y . Liu, Roformer: En- -hanced transformer with rotary position embedding, arXiv preprint -arXiv:2104.09864 (2021). 4, 9, 17 -[67] R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences -with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7, -23 -[68] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and -memory-e fficient exact attention with io-awareness, Advances in Neural -Information Processing Systems 35 (2022) 16344–16359. 4 -[69] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks -are universal approximators, Neural networks 2 (5) (1989) 359–366. 4 -[70] V . Nair, G. E. Hinton, Rectified linear units improve restricted boltz- -mann machines, in: Proceedings of the 27th international conference onmachine learning (ICML-10), 2010, pp. 807–814. 4 -[71] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv -preprint arXiv:1606.08415 (2016). 4 -[72] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, -Dropout: a simple way to prevent neural networks from overfitting, The -journal of machine learning research 15 (1) (2014) 1929–1958. 4 -[73] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. -Ke, A. Goyal, Y . Bengio, A. Courville, C. Pal, Zoneout: Regular- -izing rnns by randomly preserving hidden activations, arXiv preprint -arXiv:1606.01305 (2016). 4 -[74] N. Shazeer, Glu variants improve transformer, arXiv preprint -arXiv:2002.05202 (2020). 4 -[75] Y . N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with -gated convolutional networks, in: International conference on machine -learning, PMLR, 2017, pp. 933–941. 4 -[76] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint -arXiv:1607.06450 (2016). 4 -[77] B. Zhang, R. Sennrich, Root mean square layer normalization, Advances -in Neural Information Processing Systems 32 (2019). 4 -[78] A. Baevski, M. Auli, Adaptive input representations for neural language -modeling, arXiv preprint arXiv:1809.10853 (2018). 4 -[79] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling -transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4 -[80] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, -Megatron-lm: Training multi-billion parameter language models using -model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5 -[81] "bmtrain: E fficient training for big models.". -URL https://github.com/OpenBMB/BMTrain 4, 5 -[82] T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis- -tac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the- -art natural language processing, in: Proceedings of the 2020 conference -on empirical methods in natural language processing: system demon- -strations, 2020, pp. 38–45. 5 -[83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclau- -rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, et al., -Jax: composable transformations of python +numpy programs (2018). -5 -[84] S. Li, J. Fang, Z. Bian, H. Liu, Y . Liu, H. Huang, B. Wang, Y . You, -Colossal-ai: A unified deep learning system for large-scale parallel train- -ing, arXiv preprint arXiv:2110.14883 (2021). 5 -[85] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, J. Tang, Fastmoe: A -fast mixture-of-expert training system, arXiv preprint arXiv:2103.13262 -(2021). 5 -[86] L. Huawei Technologies Co., Huawei mindspore ai development frame- -work, in: Artificial Intelligence Technology, Springer, 2022, pp. 137– -162. 5 -[87] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, -T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imper- -ative style, high-performance deep learning library, Advances in neural -information processing systems 32 (2019). 5 -[88] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, -S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large- -scale machine learning., in: Osdi, V ol. 16, Savannah, GA, USA, 2016, -pp. 265–283. 5 -[89] T. Chen, M. Li, Y . Li, M. Lin, N. Wang, M. Wang, T. Xiao, -B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible and e fficient machine -learning library for heterogeneous distributed systems, arXiv preprint -arXiv:1512.01274 (2015). 5 -[90] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to tril- -lion parameter models with simple and e fficient sparsity, The Journal of -Machine Learning Research 23 (1) (2022) 5232–5270. 5, 9 -[91] N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, -Y . Zhou, A. W. Yu, O. Firat, et al., Glam: E fficient scaling of language -models with mixture-of-experts, in: International Conference on Ma- -chine Learning, PMLR, 2022, pp. 5547–5569. 5, 9, 23, 24, 25 -[92] X. Ren, P. Zhou, X. Meng, X. Huang, Y . Wang, W. Wang, P. Li, -X. Zhang, A. Podolskiy, G. Arshinov, et al., Pangu-P: Towards trillion -parameter language model with sparse heterogeneous computing, arXiv -preprint arXiv:2303.10845 (2023). 5, 10, 16, 23, 24, 25 -[93] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy, -J. Launay, C. Ra ffel, What language model architecture and pretrain- -37 - ---- Page 38 --- -ing objective works best for zero-shot generalization?, in: International -Conference on Machine Learning, PMLR, 2022, pp. 22964–22984. 5 -[94] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, -H.-W. Hon, Unified language model pre-training for natural language -understanding and generation, Advances in neural information process- -ing systems 32 (2019). 6 -[95] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, -S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language -models, arXiv preprint arXiv:2001.08361 (2020). 6 -[96] J. Ho ffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, -E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, -et al., Training compute-optimal large language models, arXiv preprint -arXiv:2203.15556 (2022). 6, 9, 25, 29 -[97] S. Iyer, X. V . Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, -T. Wang, Q. Liu, P. S. Koura, et al., Opt-iml: Scaling language model in- -struction meta learning through the lens of generalization, arXiv preprint -arXiv:2212.12017 (2022). 7, 11, 16, 17, 22, 25, 28 -[98] Z. Sun, Y . Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y . Yang, C. Gan, -Principle-driven self-alignment of language models from scratch with -minimal human supervision, arXiv preprint arXiv:2305.03047 (2023). -7, 17 -[99] A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, -N. Joseph, B. Mann, N. DasSarma, et al., A general language assistant -as a laboratory for alignment, arXiv preprint arXiv:2112.00861 (2021). -7 -[100] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, -P. Christiano, G. Irving, Fine-tuning language models from human pref- -erences, arXiv preprint arXiv:1909.08593 (2019). 7 -[101] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, M. Seo, The cot collec- -tion: Improving zero-shot and few-shot learning of language models via -chain-of-thought fine-tuning, arXiv preprint arXiv:2305.14045 (2023). -7, 16 -[102] Q. Liu, F. Zhou, Z. Jiang, L. Dou, M. Lin, From zero to hero: Exam- -ining the power of symbolic tasks in instruction tuning, arXiv preprint -arXiv:2304.07995 (2023). 7, 16 -[103] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, -D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large -language models, Advances in Neural Information Processing Systems -35 (2022) 24824–24837. 7, 20, 23 -[104] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- -hery, D. Zhou, Self-consistency improves chain of thought reasoning in -language models, arXiv preprint arXiv:2203.11171 (2022). 7, 20 -[105] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Gri ffiths, Y . Cao, K. Narasimhan, -Tree of thoughts: Deliberate problem solving with large language mod- -els, arXiv preprint arXiv:2305.10601 (2023). 7, 20 -[106] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, -A. Gesmundo, M. Attariyan, S. Gelly, Parameter-e fficient transfer learn- -ing for nlp, in: International Conference on Machine Learning, PMLR, -2019, pp. 2790–2799. 7, 20 -[107] S. McCandlish, J. Kaplan, D. Amodei, O. D. Team, An empirical model -of large-batch training, arXiv preprint arXiv:1812.06162 (2018). 7 -[108] W. Zeng, X. Ren, T. Su, H. Wang, Y . Liao, Z. Wang, X. Jiang, Z. Yang, -K. Wang, X. Zhang, et al., Pangu- α: Large-scale autoregressive pre- -trained chinese language models with auto-parallel computation, arXiv -preprint arXiv:2104.12369 (2021). 8, 23, 24, 25 -[109] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y . Cen, X. Zou, Z. Yang, -J. Tang, Wudaocorpora: A super large-scale chinese corpora for pre- -training language models, AI Open 2 (2021) 65–68. 8, 30 -[110] Y . Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, -Y . Zhao, Y . Lu, et al., Ernie 3.0: Large-scale knowledge enhanced -pre-training for language understanding and generation, arXiv preprint -arXiv:2107.02137 (2021). 8, 25 -[111] Z. Dai, Z. Yang, Y . Yang, J. Carbonell, Q. V . Le, R. Salakhutdinov, -Transformer-xl: Attentive language models beyond a fixed-length con- -text, arXiv preprint arXiv:1901.02860 (2019). 8 -[112] O. Lieber, O. Sharir, B. Lenz, Y . Shoham, Jurassic-1: Technical details -and evaluation, White Paper. AI21 Labs 1 (2021). 8, 24, 25 -[113] Y . Levine, N. Wies, O. Sharir, H. Bata, A. Shashua, Limits to depth ef- -ficiencies of self-attention, Advances in Neural Information Processing -Systems 33 (2020) 22640–22651. 8, 11 -[114] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park,S. Kim, S. Kim, D. Seo, et al., What changes can large-scale language -models bring? intensive study on hyperclova: Billions-scale korean -generative pretrained transformers, arXiv preprint arXiv:2109.04650 -(2021). 8, 25 -[115] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo, -L. Xu, et al., Yuan 1.0: Large-scale pre-trained language model in zero- -shot and few-shot learning, arXiv preprint arXiv:2110.04725 (2021). 8, -24, 25 -[116] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Ho ffmann, F. Song, -J. Aslanides, S. Henderson, R. Ring, S. Young, et al., Scaling lan- -guage models: Methods, analysis & insights from training gopher, arXiv -preprint arXiv:2112.11446 (2021). 8, 9, 25, 28 -[117] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, -J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V . Korthikanti, et al., -Using deepspeed and megatron to train megatron-turing nlg 530b, a -large-scale generative language model, arXiv preprint arXiv:2201.11990 -(2022). 8, 9, 24, 25 -[118] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, -H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open- -source autoregressive language model, arXiv preprint arXiv:2204.06745 -(2022). 9, 23, 24, 25 -[119] W. Ben, K. Aran, Gpt-j-6b: A 6 billion parameter autoregressive lan- -guage model (2021). 9 -[120] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, -B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed pre- -cision training, arXiv preprint arXiv:1710.03740 (2017). 9, 23 -[121] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hin- -ton, J. Dean, Outrageously large neural networks: The sparsely-gated -mixture-of-experts layer, arXiv preprint arXiv:1701.06538 (2017). 9, 23 -[122] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, -H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, et al., Alex- -atm 20b: Few-shot learning using a large-scale multilingual seq2seq -model, arXiv preprint arXiv:2208.01448 (2022). 9, 23, 24, 25 -[123] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, -S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., Palm 2 technical report, -arXiv preprint arXiv:2305.10403 (2023). 9, 25 -[124] Y . Tay, J. Wei, H. W. Chung, V . Q. Tran, D. R. So, S. Shakeri, X. Garcia, -H. S. Zheng, J. Rao, A. Chowdhery, et al., Transcending scaling laws -with 0.1% extra compute, arXiv preprint arXiv:2210.11399 (2022). 9, -24, 25 -[125] Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. -Chung, D. Bahri, T. Schuster, S. Zheng, et al., Ul2: Unifying lan- -guage learning paradigms, in: The Eleventh International Conference -on Learning Representations, 2022. 9, 10, 24, 25 -[126] Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, J. Tang, Glm: Gen- -eral language model pretraining with autoregressive blank infilling, in: -Proceedings of the 60th Annual Meeting of the Association for Compu- -tational Linguistics (V olume 1: Long Papers), 2022, pp. 320–335. 10 -[127] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, -T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., -Llama: Open and e fficient foundation language models, arXiv preprint -arXiv:2302.13971 (2023). 10, 23, 25 -[128] M. N. Rabe, C. Staats, Self-attention does not need o(n2) memory, arXiv -preprint arXiv:2112.05682 (2021). 10 -[129] V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, -M. Shoeybi, B. Catanzaro, Reducing activation recomputation in large -transformer models, Proceedings of Machine Learning and Systems 5 -(2023). 10 -[130] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, -A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of -models, arXiv preprint arXiv:2407.21783 (2024). 10, 25 -[131] https://mistral.ai/news/mixtral-8x22b/ . 10, 25 -[132] https://github.com/Snowflake-Labs/snowflake-arctic . 10, -25 -[133] https://github.com/xai-org/grok-1 . 10 -[134] https://x.ai/blog/grok-1.5 . 10 -[135] G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, -J. Schalkwyk, A. M. Dai, A. Hauth, et al., Gemini: a family of highly -capable multimodal models, arXiv preprint arXiv:2312.11805 (2023). -10 -[136] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. -38 - ---- Page 39 --- -Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., Gem- -ini 1.5: Unlocking multimodal understanding across millions of tokens -of context, arXiv preprint arXiv:2403.05530 (2024). 10 -[137] B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P. Bhattacharya, A. Brun- -dyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al., Nemotron-4 340b -technical report, arXiv preprint arXiv:2406.11704 (2024). 10, 25 -[138] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, -Q. Du, Z. Fu, et al., Deepseek llm: Scaling open-source language models -with longtermism, arXiv preprint arXiv:2401.02954 (2024). 10, 25 -[139] DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, -C. Deng, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, -F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, -H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, -J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, -K. Guan, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Zhang, M. Li, -M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, -P. Wang, P. Zhang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, -R. Pan, R. Xu, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, -S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Zheng, T. Wang, T. Pei, -T. Yuan, T. Sun, W. L. Xiao, W. Zeng, W. An, W. Liu, W. Liang, W. Gao, -W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, -X. Chen, X. Chen, X. Nie, X. Sun, Deepseek-v2: A strong, economical, -and e fficient mixture-of-experts language model, CoRR abs /2405.04434 -(2024). 10, 25 -[140] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, -C. Xiong, Codegen: An open large language model for code with multi- -turn program synthesis, arXiv preprint arXiv:2203.13474 (2022). 11, -23, 25, 28 -[141] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Ed- -wards, Y . Burda, N. Joseph, G. Brockman, et al., Evaluating large lan- -guage models trained on code, arXiv preprint arXiv:2107.03374 (2021). -11, 25, 29, 31 -[142] Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, -T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level -code generation with alphacode, Science 378 (6624) (2022) 1092–1097. -11, 23, 25, 29 -[143] N. Shazeer, Fast transformer decoding: One write-head is all you need, -arXiv preprint arXiv:1911.02150 (2019). 11 -[144] R. Y . Pang, H. He, Text generation by learning from demonstrations, -arXiv preprint arXiv:2009.07839 (2020). 11 -[145] R. Dabre, A. Fujita, Softmax tempering for training neural machine -translation models, arXiv preprint arXiv:2009.09372 (2020). 11 -[146] Y . Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified -pre-trained encoder-decoder models for code understanding and genera- -tion, arXiv preprint arXiv:2109.00859 (2021). 11 -[147] R. Li, L. B. Allal, Y . Zi, N. Muennigho ff, D. Kocetkov, C. Mou, -M. Marone, C. Akiki, J. Li, J. Chim, et al., Starcoder: may the source be -with you!, arXiv preprint arXiv:2305.06161 (2023). 11, 25 -[148] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, -A. Poulton, V . Kerkez, R. Stojnic, Galactica: A large language model for -science, arXiv preprint arXiv:2211.09085 (2022). 11, 24, 25, 29 -[149] FairScale authors, Fairscale: A general purpose modular pytorch library -for high performance and large scale training, https://github.com/ -facebookresearch/fairscale (2021). 11 -[150] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. -Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al., Lamda: Language models -for dialog applications, arXiv preprint arXiv:2201.08239 (2022). 11, 25 -[151] S. Wu, O. Irsoy, S. Lu, V . Dabravolski, M. Dredze, S. Gehrmann, -P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language -model for finance, arXiv preprint arXiv:2303.17564 (2023). 11, 25, 33 -[152] X. Zhang, Q. Yang, D. Xu, Xuanyuan 2.0: A large chinese finan- -cial chat model with hundreds of billions parameters, arXiv preprint -arXiv:2305.12002 (2023). 11, 17, 25 -[153] W. Ben, Mesh-transformer-jax: Model-parallel implementation of trans- -former language model with jax (2021). 12, 24 -[154] N. Muennigho ff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, -T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al., -Crosslingual generalization through multitask finetuning, arXiv preprint -arXiv:2211.01786 (2022). 16, 25, 28, 31 -[155] D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, K.-W. Chang, -Dynosaur: A dynamic growth paradigm for instruction-tuning data cu-ration, arXiv preprint arXiv:2305.14327 (2023). 16 -[156] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, -C. He, X. Yue, et al., Llama-adapter v2: Parameter-e fficient visual in- -struction model, arXiv preprint arXiv:2304.15010 (2023). 16, 24 -[157] Openai. gpt-4 technical report (2023). 16, 35 -[158] R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, -T. B. Hashimoto, Stanford alpaca: An instruction-following llama -model, https://github.com/tatsu-lab/stanford_alpaca -(2023). 16, 25, 28 -[159] W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, -S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An -open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March -2023). -URL https://lmsys.org/blog/2023-03-30-vicuna/ 16, 22, 25, -28 -[160] B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction tuning with gpt-4, -arXiv preprint arXiv:2304.03277 (2023). 16, 28 -[161] T. Liu, B. K. H. Low, Goat: Fine-tuned llama outperforms gpt-4 on -arithmetic tasks, arXiv preprint arXiv:2305.14201 (2023). 16 -[162] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo: -Tuning llama model with chinese medical knowledge, arXiv preprint -arXiv:2304.06975 (2023). 16 -[163] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, D. Jiang, -Wizardlm: Empowering large language models to follow complex in- -structions, arXiv preprint arXiv:2304.12244 (2023). 16 -[164] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, -D. Jiang, Wizardcoder: Empowering code large language models with -evol-instruct, arXiv preprint arXiv:2306.08568 (2023). 16, 25 -[165] J. Menick, M. Trebacz, V . Mikulik, J. Aslanides, F. Song, M. Chadwick, -M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al., Teach- -ing language models to support answers with verified quotes, arXiv -preprint arXiv:2203.11147 (2022). 17 -[166] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, -C. Hesse, S. Jain, V . Kosaraju, W. Saunders, et al., Webgpt: Browser- -assisted question-answering with human feedback, arXiv preprint -arXiv:2112.09332 (2021). 17, 19, 20, 25, 31 -[167] A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, -M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al., Improving -alignment of dialogue agents via targeted human judgements, arXiv -preprint arXiv:2209.14375 (2022). 17, 20, 25 -[168] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn, -Direct preference optimization: Your language model is secretly a re- -ward model, arXiv preprint arXiv:2305.18290 (2023). 17 -[169] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, -T. Zhang, Raft: Reward ranked finetuning for generative foundation -model alignment, arXiv preprint arXiv:2304.06767 (2023). 17 -[170] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, F. Huang, Rrhf: Rank -responses to align language models with human feedback without tears, -arXiv preprint arXiv:2304.05302 (2023). 17 -[171] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y . Li, H. Wang, Preference rank- -ing optimization for human alignment, arXiv preprint arXiv:2306.17492 -(2023). 17 -[172] H. Liu, C. Sferrazza, P. Abbeel, Languages are rewards: Hindsight fine- -tuning using human feedback, arXiv preprint arXiv:2302.02676 (2023). -17 -[173] Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, -A. Goldie, A. Mirhoseini, C. McKinnon, et al., Constitutional ai: Harm- -lessness from ai feedback, arXiv preprint arXiv:2212.08073 (2022). 17 -[174] Y . Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, -P. Liang, T. B. Hashimoto, Alpacafarm: A simulation frame- -work for methods that learn from human feedback, arXiv preprint -arXiv:2305.14387 (2023). 17 -[175] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, L. Wang, -Prompting gpt-3 to be reliable, arXiv preprint arXiv:2210.09150 (2022). -17 -[176] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukoši ¯ut˙e, A. Chen, -A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez, et al., The capac- -ity for moral self-correction in large language models, arXiv preprint -arXiv:2302.07459 (2023). 17 -[177] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety -training fail?, arXiv preprint arXiv:2307.02483 (2023). 17 -39 - ---- Page 40 --- -[178] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, -B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al., Red teaming lan- -guage models to reduce harms: Methods, scaling behaviors, and lessons -learned, arXiv preprint arXiv:2209.07858 (2022). 17, 28 -[179] S. Casper, J. Lin, J. Kwon, G. Culp, D. Hadfield-Menell, Explore, estab- -lish, exploit: Red teaming language models from scratch, arXiv preprint -arXiv:2306.09442 (2023). 17 -[180] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, -N. McAleese, G. Irving, Red teaming language models with language -models, arXiv preprint arXiv:2202.03286 (2022). 17 -[181] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned language models are -continual learners, in: Proceedings of the 2022 Conference on Empirical -Methods in Natural Language Processing, 2022, pp. 6107–6122. 17 -[182] Z. Shi, A. Lipani, Don’t stop pretraining? make prompt-based fine- -tuning powerful learner, arXiv preprint arXiv:2305.01711 (2023). 17 -[183] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty, -C. Baral, Instruction tuned models are quick learners, arXiv preprint -arXiv:2306.05539 (2023). 17 -[184] H. Chen, Y . Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y . Yanggong, -J. Zhao, Maybe only 0.5% data is needed: A preliminary exploration -of low training data instruction tuning, arXiv preprint arXiv:2305.09246 -(2023). 17 -[185] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, -P. Yu, L. Yu, et al., Lima: Less is more for alignment, arXiv preprint -arXiv:2305.11206 (2023). 17, 25, 28 -[186] C. Han, Q. Wang, W. Xiong, Y . Chen, H. Ji, S. Wang, Lm-infinite: Sim- -ple on-the-fly length generalization for large language models, arXiv -preprint arXiv:2308.16137 (2023). 17, 18 -[187] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y . Zemlyan- -skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y . Tay, et al., Colt5: Faster -long-range transformers with conditional computation, arXiv preprint -arXiv:2303.09752 (2023). 18 -[188] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, F. Wei, -Longnet: Scaling transformers to 1,000,000,000 tokens, arXiv preprint -arXiv:2307.02486 (2023). 18 -[189] Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, Longlora: E ffi- -cient fine-tuning of long-context large language models, arXiv preprint -arXiv:2309.12307 (2023). 18 -[190] N. Ratner, Y . Levine, Y . Belinkov, O. Ram, I. Magar, O. Abend, -E. Karpas, A. Shashua, K. Leyton-Brown, Y . Shoham, Parallel context -windows for large language models, in: Proceedings of the 61st Annual -Meeting of the Association for Computational Linguistics (V olume 1: -Long Papers), 2023, pp. 6383–6402. 18 -[191] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, F. Wei, -Augmenting language models with long-term memory, arXiv preprint -arXiv:2306.07174 (2023). 18 -[192] X. Xu, Z. Gou, W. Wu, Z.-Y . Niu, H. Wu, H. Wang, S. Wang, Long -time no see! open-domain conversation with long-term persona memory, -arXiv preprint arXiv:2203.05797 (2022). 18 -[193] S. Borgeaud, A. Mensch, J. Ho ffmann, T. Cai, E. Rutherford, K. Milli- -can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al., -Improving language models by retrieving from trillions of tokens, in: -International conference on machine learning, PMLR, 2022, pp. 2206– -2240. 18, 19, 34 -[194] W. Zhong, L. Guo, Q. Gao, Y . Wang, Memorybank: Enhanc- -ing large language models with long-term memory, arXiv preprint -arXiv:2305.10250 (2023). 18 -[195] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, S. Yao, -Reflexion: Language agents with verbal reinforcement learning, arXiv -preprint arXiv:2303.11366 14 (2023). 18, 20 -[196] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, H. Zhao, Chatdb: Augment- -ing llms with databases as their symbolic memory, arXiv preprint -arXiv:2306.03901 (2023). 18 -[197] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, -J. Callan, G. Neubig, Active retrieval augmented generation, arXiv -preprint arXiv:2305.06983 (2023). 18 -[198] O. Ram, Y . Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton- -Brown, Y . Shoham, In-context retrieval-augmented language models, -arXiv preprint arXiv:2302.00083 (2023). 18, 34 -[199] X. Li, X. Qiu, Mot: Pre-thinking and recalling enable chatgpt to self- -improve with memory-of-thoughts, arXiv preprint arXiv:2305.05181(2023). 18 -[200] D. Schuurmans, Memory augmented large language models are compu- -tationally universal, arXiv preprint arXiv:2301.04589 (2023). 18 -[201] A. Modarressi, A. Imani, M. Fayyaz, H. Schütze, Ret-llm: Towards a -general read-write memory for large language models, arXiv preprint -arXiv:2305.14322 (2023). 18 -[202] S. Robertson, H. Zaragoza, et al., The probabilistic relevance frame- -work: Bm25 and beyond, Foundations and Trends ®in Information Re- -trieval 3 (4) (2009) 333–389. 18 -[203] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, D. Zhou, -Rationale-augmented ensembles in language models, arXiv preprint -arXiv:2207.00747 (2022). 18 -[204] F. Zhang, B. Chen, Y . Zhang, J. Liu, D. Zan, Y . Mao, J.-G. Lou, W. Chen, -Repocoder: Repository-level code completion through iterative retrieval -and generation, arXiv preprint arXiv:2303.12570 (2023). 18 -[205] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y . Dong, -O. Kuchaiev, B. Li, C. Xiao, et al., Shall we pretrain autoregressive -language models with retrieval? a comprehensive study, arXiv preprint -arXiv:2304.06762 (2023). 19 -[206] L. Wang, N. Yang, F. Wei, Learning to retrieve in-context examples for -large language models, arXiv preprint arXiv:2307.07164 (2023). 19 -[207] J. Liu, D. Shen, Y . Zhang, B. Dolan, L. Carin, W. Chen, What makes -good in-context examples for gpt-3?, arXiv preprint arXiv:2101.06804 -(2021). 19 -[208] O. Rubin, J. Herzig, J. Berant, Learning to retrieve prompts for in- -context learning, arXiv preprint arXiv:2112.08633 (2021). 19 -[209] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle- -moyer, W.-t. Yih, Replug: Retrieval-augmented black-box language -models, arXiv preprint arXiv:2301.12652 (2023). 19 -[210] O. Rubin, J. Berant, Long-range language modeling with self-retrieval, -arXiv preprint arXiv:2306.13421 (2023). 19 -[211] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented -language model pre-training, in: International conference on machine -learning, PMLR, 2020, pp. 3929–3938. 19 -[212] S. Hofstätter, J. Chen, K. Raman, H. Zamani, Fid-light: E fficient and ef- -fective retrieval-augmented text generation, in: Proceedings of the 46th -International ACM SIGIR Conference on Research and Development in -Information Retrieval, 2023, pp. 1437–1447. 19 -[213] M. Komeili, K. Shuster, J. Weston, Internet-augmented dialogue gener- -ation, arXiv preprint arXiv:2107.07566 (2021). 19 -[214] A. Lazaridou, E. Gribovskaya, W. Stokowiec, N. Grigorev, Internet- -augmented language models through few-shot prompting for open- -domain question answering, arXiv preprint arXiv:2203.05115 (2022). -19 -[215] D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, M. Z. Shou, Assist- -gpt: A general multi-modal assistant that can plan, execute, inspect, and -learn, arXiv preprint arXiv:2306.08640 (2023). 19 -[216] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y . N. Wu, S.-C. Zhu, -J. Gao, Chameleon: Plug-and-play compositional reasoning with large -language models, arXiv preprint arXiv:2304.09842 (2023). 19, 20, 23 -[217] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, M. T. -Ribeiro, Art: Automatic multi-step reasoning and tool-use for large lan- -guage models, arXiv preprint arXiv:2303.09014 (2023). 19 -[218] C.-Y . Hsieh, S.-A. Chen, C.-L. Li, Y . Fujii, A. Ratner, C.-Y . Lee, R. Kr- -ishna, T. Pfister, Tool documentation enables zero-shot tool-usage with -large language models, arXiv preprint arXiv:2308.00675 (2023). 19 -[219] Y . Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y . Tian, S. Li, Restgpt: -Connecting large language models with real-world applications via rest- -ful apis, arXiv preprint arXiv:2306.06624 (2023). 19 -[220] S. Hao, T. Liu, Z. Wang, Z. Hu, Toolkengpt: Augmenting frozen lan- -guage models with massive tools via tool embeddings, arXiv preprint -arXiv:2305.11554 (2023). 19 -[221] S. G. Patil, T. Zhang, X. Wang, J. E. Gonzalez, Gorilla: Large language -model connected with massive apis, arXiv preprint arXiv:2305.15334 -(2023). 19 -[222] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, J. Zhang, On the tool manipu- -lation capability of open-source large language models, arXiv preprint -arXiv:2305.16504 (2023). 19 -[223] Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, -B. Qian, et al., Toolllm: Facilitating large language models to master -16000 +real-world apis, arXiv preprint arXiv:2307.16789 (2023). 19, -40 - ---- Page 41 --- -20 -[224] Y . Shen, K. Song, X. Tan, D. Li, W. Lu, Y . Zhuang, Hugginggpt: Solv- -ing ai tasks with chatgpt and its friends in huggingface, arXiv preprint -arXiv:2303.17580 (2023). 19, 20, 33 -[225] Y . Liang, C. Wu, T. Song, W. Wu, Y . Xia, Y . Liu, Y . Ou, S. Lu, L. Ji, -S. Mao, et al., Taskmatrix. ai: Completing tasks by connecting foun- -dation models with millions of apis, arXiv preprint arXiv:2303.16434 -(2023). 19 -[226] D. Surís, S. Menon, C. V ondrick, Vipergpt: Visual inference via python -execution for reasoning, arXiv preprint arXiv:2303.08128 (2023). 20 -[227] A. Maedche, S. Morana, S. Schacht, D. Werth, J. Krumeich, Advanced -user assistance systems, Business & Information Systems Engineering -58 (2016) 367–370. 20 -[228] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence -134 (1-2) (2002) 57–83. 20 -[229] S. Hong, X. Zheng, J. Chen, Y . Cheng, J. Wang, C. Zhang, Z. Wang, -S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for -multi-agent collaborative framework, arXiv preprint arXiv:2308.00352 -(2023). 20 -[230] Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, -S. Jin, E. Zhou, et al., The rise and potential of large language model -based agents: A survey, arXiv preprint arXiv:2309.07864 (2023). 20 -[231] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, -X. Chen, Y . Lin, et al., A survey on large language model based au- -tonomous agents, arXiv preprint arXiv:2308.11432 (2023). 20 -[232] W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language models as zero- -shot planners: Extracting actionable knowledge for embodied agents, -in: International Conference on Machine Learning, PMLR, 2022, pp. -9118–9147. 20 -[233] S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, Z. Hu, Reason- -ing with language model is planning with world model, arXiv preprint -arXiv:2305.14992 (2023). 20, 33 -[234] W. Yao, S. Heinecke, J. C. Niebles, Z. Liu, Y . Feng, L. Xue, R. Murthy, -Z. Chen, J. Zhang, D. Arpit, et al., Retroformer: Retrospective -large language agents with policy gradient optimization, arXiv preprint -arXiv:2308.02151 (2023). 20, 33 -[235] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, -J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, -N. Brown, L. Luu, S. Levine, K. Hausman, brian ichter, Inner mono- -logue: Embodied reasoning through planning with language models, in: -6th Annual Conference on Robot Learning, 2022. -URL https://openreview.net/forum?id=3R3Pz5i0tye 20 -[236] C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, J. Fu, Alphablock: -Embodied finetuning for vision-language reasoning in robot manipula- -tion, arXiv preprint arXiv:2305.18898 (2023). 20, 33 -[237] I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, -J. Thomason, A. Garg, Progprompt: Generating situated robot task plans -using large language models, in: 2023 IEEE International Conference on -Robotics and Automation (ICRA), IEEE, 2023, pp. 11523–11530. 20, -33 -[238] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. -Chiang, T. Erez, L. Hasenclever, J. Humplik, et al., Language to rewards -for robotic skill synthesis, arXiv preprint arXiv:2306.08647 (2023). 20 -[239] X. Tang, A. Zou, Z. Zhang, Y . Zhao, X. Zhang, A. Cohan, M. Gerstein, -Medagents: Large language models as collaborators for zero-shot med- -ical reasoning, arXiv preprint arXiv:2311.10537 (2023). 20 -[240] A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, -J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as i can, not as i say: -Grounding language in robotic a ffordances, in: Conference on Robot -Learning, PMLR, 2023, pp. 287–318. 20, 33 -[241] H. Ha, P. Florence, S. Song, Scaling up and distilling down: Language- -guided robot skill acquisition, arXiv preprint arXiv:2307.14535 (2023). -20 -[242] A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, A. Velasquez, Say- -nav: Grounding large language models for dynamic planning to navi- -gation in new environments, arXiv preprint arXiv:2309.04077 (2023). -20 -[243] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y . Su, -Llm-planner: Few-shot grounded planning for embodied agents with -large language models, arXiv preprint arXiv:2212.04088 (2022). 20 -[244] V . S. Dorbala, J. F. Mullen Jr, D. Manocha, Can an embodied agent findyour" cat-shaped mug"? llm-based zero-shot object navigation, arXiv -preprint arXiv:2303.03480 (2023). 20 -[245] C. Huang, O. Mees, A. Zeng, W. Burgard, Visual language maps for -robot navigation, in: 2023 IEEE International Conference on Robotics -and Automation (ICRA), IEEE, 2023, pp. 10608–10615. 20 -[246] Y . Ding, X. Zhang, C. Paxton, S. Zhang, Task and motion planning -with large language models for object rearrangement, arXiv preprint -arXiv:2303.06247 (2023). 20, 33 -[247] X. Liu, Y . Zheng, Z. Du, M. Ding, Y . Qian, Z. Yang, J. Tang, Gpt under- -stands, too, arXiv preprint arXiv:2103.10385 (2021). 20, 21 -[248] G. Chen, F. Liu, Z. Meng, S. Liang, Revisiting parameter-e fficient tun- -ing: Are we really there yet?, arXiv preprint arXiv:2202.07962 (2022). -20 -[249] Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, J. Gao, -Adamix: Mixture-of-adapter for parameter-e fficient tuning of large lan- -guage models, arXiv preprint arXiv:2205.12410 1 (2) (2022) 4. 20 -[250] E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, -W. Chen, Lora: Low-rank adaptation of large language models, arXiv -preprint arXiv:2106.09685 (2021). 21, 22, 23 -[251] X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, J. Tang, P-tuning: Prompt -tuning can be comparable to fine-tuning across scales and tasks, in: Pro- -ceedings of the 60th Annual Meeting of the Association for Computa- -tional Linguistics (V olume 2: Short Papers), 2022, pp. 61–68. 21 -[252] A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, A. Almahairi, -Progressive prompts: Continual learning for language models, arXiv -preprint arXiv:2301.12314 (2023). 21 -[253] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, S. Huang, To- -wards adaptive prefix tuning for parameter-e fficient language model -fine-tuning, arXiv preprint arXiv:2305.15212 (2023). 21 -[254] E. B. Zaken, S. Ravfogel, Y . Goldberg, Bitfit: Simple parameter- -efficient fine-tuning for transformer-based masked language-models, -arXiv preprint arXiv:2106.10199 (2021). 21 -[255] T. Dettmers, M. Lewis, Y . Belkada, L. Zettlemoyer, Llm. int8 (): -8-bit matrix multiplication for transformers at scale, arXiv preprint -arXiv:2208.07339 (2022). 21, 22 -[256] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate -post-training quantization for generative pre-trained transformers, arXiv -preprint arXiv:2210.17323 (2022). 21 -[257] X. Wei, Y . Zhang, Y . Li, X. Zhang, R. Gong, J. Guo, X. Liu, Outlier sup- -pression +: Accurate quantization of large language models by equiva- -lent and optimal shifting and scaling, arXiv preprint arXiv:2304.09145 -(2023). 21 -[258] E. Frantar, D. Alistarh, Optimal brain compression: A framework for -accurate post-training quantization and pruning, Advances in Neural In- -formation Processing Systems 35 (2022) 4475–4488. 21 -[259] C. Lee, J. Jin, T. Kim, H. Kim, E. Park, Owq: Lessons learned from ac- -tivation outliers for weight quantization in large language models, arXiv -preprint arXiv:2306.02272 (2023). 21 -[260] S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.- -W. Ha, N. Sung, D. Lee, Alphatuning: Quantization-aware parameter- -efficient adaptation of large-scale pre-trained language models, arXiv -preprint arXiv:2210.03858 (2022). 21 -[261] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: E fficient -finetuning of quantized llms, arXiv preprint arXiv:2305.14314 (2023). -21, 22 -[262] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y . Mehdad, Y . Shi, R. Kr- -ishnamoorthi, V . Chandra, Llm-qat: Data-free quantization aware train- -ing for large language models, arXiv preprint arXiv:2305.17888 (2023). -21, 22 -[263] Y . Guo, A. Yao, H. Zhao, Y . Chen, Network sketching: Exploiting bi- -nary structure in deep cnns, in: Proceedings of the IEEE Conference on -Computer Vision and Pattern Recognition, 2017, pp. 5955–5963. 21 -[264] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, D. Lee, -Memory-e fficient fine-tuning of compressed large language models via -sub-4-bit integer quantization, arXiv preprint arXiv:2305.14152 (2023). -22 -[265] M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and e ffective pruning -approach for large language models, arXiv preprint arXiv:2306.11695 -(2023). 22 -[266] Z. Wang, J. Wohlwend, T. Lei, Structured pruning of large language -models, arXiv preprint arXiv:1910.04732 (2019). 22 -41 - ---- Page 42 --- -[267] L. Yin, Y . Wu, Z. Zhang, C.-Y . Hsieh, Y . Wang, Y . Jia, M. Pechenizkiy, -Y . Liang, Z. Wang, S. Liu, Outlier weighed layerwise sparsity (owl): A -missing secret sauce for pruning llms to high sparsity, arXiv preprint -arXiv:2310.05175 (2023). 22 -[268] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, N. Wong, -Structured pruning for e fficient generative pre-trained language models, -in: Findings of the Association for Computational Linguistics: ACL -2023, 2023, pp. 10880–10895. 22 -[269] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, -A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: a visual lan- -guage model for few-shot learning, Advances in Neural Information Pro- -cessing Systems 35 (2022) 23716–23736. 22 -[270] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image -pre-training with frozen image encoders and large language models, -arXiv preprint arXiv:2301.12597 (2023). 22 -[271] H. Liu, C. Li, Q. Wu, Y . J. Lee, Visual instruction tuning, arXiv preprint -arXiv:2304.08485 (2023). 22 -[272] K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, -Y . Qiao, Videochat: Chat-centric video understanding, arXiv preprint -arXiv:2305.06355 (2023). 22 -[273] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, Video-chatgpt: Towards de- -tailed video understanding via large vision and language models, arXiv -preprint arXiv:2306.05424 (2023). 22 -[274] H. Zhang, X. Li, L. Bing, Video-llama: An instruction-tuned -audio-visual language model for video understanding, arXiv preprint -arXiv:2306.02858 (2023). 22 -[275] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, -Y . Zou, W. Wang, Wavcaps: A chatgpt-assisted weakly-labelled au- -dio captioning dataset for audio-language multimodal research, arXiv -preprint arXiv:2303.17395 (2023). 22 -[276] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, Z. Tu, Macaw- -llm: Multi-modal language modeling with image, audio, video, and text -integration, arXiv preprint arXiv:2306.09093 (2023). 22 -[277] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing -vision-language understanding with advanced large language models, -arXiv preprint arXiv:2304.10592 (2023). 22 -[278] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, -T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., -An image is worth 16x16 words: Transformers for image recognition at -scale, arXiv preprint arXiv:2010.11929 (2020). 22 -[279] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, -S. Hoi, Instructblip: Towards general-purpose vision-language models -with instruction tuning, arXiv preprint arXiv:2305.06500 (2023). 22 -[280] Z. Xu, Y . Shen, L. Huang, Multiinstruct: Improving multi-modal zero- -shot learning via instruction tuning, arXiv preprint arXiv:2212.10773 -(2022). 22 -[281] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, J. Liu, -Chatbridge: Bridging modalities with large language model as a lan- -guage catalyst, arXiv preprint arXiv:2305.16103 (2023). 22 -[282] L. Li, Y . Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y . Yang, J. Xu, -X. Sun, et al., M3 it: A large-scale dataset towards multi-modal multi- -lingual instruction tuning, arXiv preprint arXiv:2306.04387 (2023). 22 -[283] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, -H. Xu, L. K. T. Zhang, Detgpt: Detect what you need via reasoning, -arXiv preprint arXiv:2305.14167 (2023). 22 -[284] G. Luo, Y . Zhou, T. Ren, S. Chen, X. Sun, R. Ji, Cheap and quick: -Efficient vision-language instruction tuning for large language models, -arXiv preprint arXiv:2305.15023 (2023). 22 -[285] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, Y . Qiao, -Llama-adapter: E fficient fine-tuning of language models with zero-init -attention, arXiv preprint arXiv:2303.16199 (2023). 22 -[286] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, -Robust speech recognition via large-scale weak supervision, in: Inter- -national Conference on Machine Learning, PMLR, 2023, pp. 28492– -28518. 22 -[287] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, A. Smola, Multi- -modal chain-of-thought reasoning in language models, arXiv preprint -arXiv:2302.00923 (2023). 23 -[288] J. Ge, H. Luo, S. Qian, Y . Gan, J. Fu, S. Zhan, Chain of thought prompt -tuning in vision language models, arXiv preprint arXiv:2304.07919 -(2023). 23[289] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talk- -ing, drawing and editing with visual foundation models, arXiv preprint -arXiv:2303.04671 (2023). 23 -[290] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, -M. Zeng, L. Wang, Mm-react: Prompting chatgpt for multimodal rea- -soning and action, arXiv preprint arXiv:2303.11381 (2023). 23 -[291] T. Wang, J. Zhang, J. Fei, Y . Ge, H. Zheng, Y . Tang, Z. Li, M. Gao, -S. Zhao, Y . Shan, et al., Caption anything: Interactive image descrip- -tion with diverse multimodal controls, arXiv preprint arXiv:2305.02677 -(2023). 23 -[292] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, P. Gao, Pointclip v2: -Adapting clip for powerful 3d open-world learning, arXiv preprint -arXiv:2211.11682 (2022). 23 -[293] T. Gupta, A. Kembhavi, Visual programming: Compositional visual rea- -soning without training, in: Proceedings of the IEEE /CVF Conference -on Computer Vision and Pattern Recognition, 2023, pp. 14953–14962. -23 -[294] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, H. Li, Dynamic -fusion with intra-and inter-modality attention flow for visual question -answering, in: Proceedings of the IEEE /CVF conference on computer -vision and pattern recognition, 2019, pp. 6639–6648. 23 -[295] Z. Yu, J. Yu, Y . Cui, D. Tao, Q. Tian, Deep modular co-attention net- -works for visual question answering, in: Proceedings of the IEEE /CVF -conference on computer vision and pattern recognition, 2019, pp. 6281– -6290. 23 -[296] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.- -W. Chang, S.-F. Chang, Idealgpt: Iteratively decomposing vision -and language reasoning via large language models, arXiv preprint -arXiv:2305.14985 (2023). 23 -[297] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y . Qiao, P. Gao, H. Li, -Prompt, generate, then cache: Cascade of foundation models makes -strong few-shot learners, in: Proceedings of the IEEE /CVF Conference -on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222. -23 -[298] T. Q. Nguyen, J. Salazar, Transformers without tears: Improving the -normalization of self-attention, CoRR abs /1910.05895 (2019). 24 -[299] Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, -L. Zettlemoyer, V . Stoyanov, Roberta: A robustly optimized bert pre- -training approach, arXiv preprint arXiv:1907.11692 (2019). 24, 30 -[300] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, -D. Song, Koala: A dialogue model for academic research, Blog post -(April 2023). -URL https://bair.berkeley.edu/blog/2023/04/03/koala/ -25 -[301] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, -J. Phang, H. He, A. Thite, N. Nabeshima, et al., The pile: An -800gb dataset of diverse text for language modeling, arXiv preprint -arXiv:2101.00027 (2020). 28, 30 -[302] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, -T. Le Scao, L. V on Werra, C. Mou, E. González Ponferrada, H. Nguyen, -et al., The bigscience roots corpus: A 1.6 tb composite multilingual -dataset, Advances in Neural Information Processing Systems 35 (2022) -31809–31826. 28 -[303] Wikipedia. -URL https://en.wikipedia.org/wiki/Main_Page 28 -[304] Together Computer, Redpajama: An open source recipe to reproduce -llama training dataset (Apr. 2023). -URL https://github.com/togethercomputer/ -RedPajama-Data 28 -[305] O. Honovich, T. Scialom, O. Levy, T. Schick, Unnatural instructions: -Tuning language models with (almost) no human labor, arXiv preprint -arXiv:2212.09689 (2022). 28 -[306] Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, -D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., Training a helpful and -harmless assistant with reinforcement learning from human feedback, -arXiv preprint arXiv:2204.05862 (2022). 28 -[307] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, -J. Steinhardt, Measuring massive multitask language understanding, -arXiv preprint arXiv:2009.03300 (2020). 26, 29 -[308] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, -A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al., Beyond -42 - ---- Page 43 --- -the imitation game: Quantifying and extrapolating the capabilities of -language models, arXiv preprint arXiv:2206.04615 (2022). 26, 29 -[309] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue: -A multi-task benchmark and analysis platform for natural language un- -derstanding, arXiv preprint arXiv:1804.07461 (2018). 26, 29 -[310] Y . Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi, -J. Bao, J. Nie, et al., Cuge: A chinese language understanding and gen- -eration evaluation benchmark, arXiv preprint arXiv:2112.13610 (2021). -29 -[311] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y . Li, Y . Xu, K. Sun, D. Yu, -C. Yu, et al., Clue: A chinese language understanding evaluation bench- -mark, arXiv preprint arXiv:2004.05986 (2020). 29 -[312] L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan, -X. Tian, L. Qin, et al., Fewclue: A chinese few-shot learning evaluation -benchmark, arXiv preprint arXiv:2107.07498 (2021). 29 -[313] E. M. Smith, M. Williamson, K. Shuster, J. Weston, Y .-L. Boureau, Can -you put it all together: Evaluating conversational agents’ ability to blend -skills, arXiv preprint arXiv:2004.08449 (2020). 29 -[314] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, -Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al., Holistic evaluation of -language models, arXiv preprint arXiv:2211.09110 (2022). 29 -[315] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim, -Y . Song, T. Oh, et al., Klue: Korean language understanding evaluation, -arXiv preprint arXiv:2105.09680 (2021). 29 -[316] S. Reddy, D. Chen, C. D. Manning, Coqa: A conversational question -answering challenge, Transactions of the Association for Computational -Linguistics 7 (2019) 249–266. 27, 29 -[317] M. T. Pilehvar, J. Camacho-Collados, Wic: 10,000 example -pairs for evaluating context-sensitive representations, arXiv preprint -arXiv:1808.09121 6 (2018). 27, 29 -[318] S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer sentinel mixture -models, arXiv preprint arXiv:1609.07843 (2016). 28, 29 -[319] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lillicrap, Compres- -sive transformers for long-range sequence modelling, arXiv preprint -arXiv:1911.05507 (2019). 28, 29 -[320] X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, B. Tang, Lcqmc: A -large-scale chinese question matching corpus, in: Proceedings of the -27th international conference on computational linguistics, 2018, pp. -1952–1962. 28, 29 -[321] S. Iyer, N. Dandekar, K. Csernai, First quora dataset re- -lease: Question pairs, https://quoradata.quora.com/ -First-Quora-Dataset-Release-Question-Pairs . 29 -[322] R. Rudinger, J. Naradowsky, B. Leonard, B. Van Durme, Gender bias in -coreference resolution, arXiv preprint arXiv:1804.09301 (2018). 29 -[323] M.-C. De Marne ffe, M. Simons, J. Tonhauser, The commitmentbank: In- -vestigating projection in naturally occurring discourse, in: proceedings -of Sinn und Bedeutung, V ol. 23, 2019, pp. 107–124. 29 -[324] Z. Li, N. Ding, Z. Liu, H. Zheng, Y . Shen, Chinese relation extraction -with multi-grained information and external linguistic knowledge, in: -Proceedings of the 57th Annual Meeting of the Association for Compu- -tational Linguistics, 2019, pp. 4377–4386. 29 -[325] J. Xu, J. Wen, X. Sun, Q. Su, A discourse-level named entity recognition -and relation extraction dataset for chinese literature text, arXiv preprint -arXiv:1711.07010 (2017). 29 -[326] J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, B. Tang, The bq corpus: A -large-scale domain-specific chinese corpus for sentence semantic equiv- -alence identification, in: Proceedings of the 2018 conference on empiri- -cal methods in natural language processing, 2018, pp. 4946–4951. 29 -[327] B. Liu, D. Niu, H. Wei, J. Lin, Y . He, K. Lai, Y . Xu, Matching arti- -cle pairs with graphical decomposition and convolutions, arXiv preprint -arXiv:1802.07459 (2018). 29 -[328] P. Li, W. Li, Z. He, X. Wang, Y . Cao, J. Zhou, W. Xu, Dataset and neu- -ral recurrent sequence labeling model for open-domain factoid question -answering, arXiv preprint arXiv:1607.06275 (2016). 29 -[329] N. Peng, M. Dredze, Named entity recognition for chinese social media -with jointly trained embeddings, in: Proceedings of the 2015 conference -on empirical methods in natural language processing, 2015, pp. 548– -554. 29 -[330] W. Ling, D. Yogatama, C. Dyer, P. Blunsom, Program induction by ratio- -nale generation: Learning to solve and explain algebraic word problems, -arXiv preprint arXiv:1705.04146 (2017). 29[331] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar- -cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, et al., Ontonotes re- -lease 4.0, LDC2011T03, Philadelphia, Penn.: Linguistic Data Consor- -tium (2011). 29 -[332] D. Vilares, C. Gómez-Rodríguez, Head-qa: A healthcare dataset for -complex reasoning, arXiv preprint arXiv:1906.04701 (2019). 29 -[333] S. L. Blodgett, L. Green, B. O’Connor, Demographic dialectal variation -in social media: A case study of african-american english, arXiv preprint -arXiv:1608.08868 (2016). 29 -[334] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- -derwende, P. Kohli, J. Allen, A corpus and evaluation framework -for deeper understanding of commonsense stories, arXiv preprint -arXiv:1604.01696 (2016). 28, 29 -[335] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, -S. Pezzelle, M. Baroni, G. Boleda, R. Fernández, The lambada dataset: -Word prediction requiring a broad discourse context, arXiv preprint -arXiv:1606.06031 (2016). 28, 29 -[336] B. Hu, Q. Chen, F. Zhu, Lcsts: A large scale chinese short text summa- -rization dataset, arXiv preprint arXiv:1506.05865 (2015). 29 -[337] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long and diverse text gener- -ation with planning-based hierarchical variational model, arXiv preprint -arXiv:1908.06605 (2019). 29 -[338] J. Novikova, O. Dušek, V . Rieser, The e2e dataset: New challenges for -end-to-end generation, arXiv preprint arXiv:1706.09254 (2017). 29 -[339] C. Zheng, M. Huang, A. Sun, Chid: A large-scale chinese idiom dataset -for cloze test, arXiv preprint arXiv:1906.01265 (2019). 29 -[340] Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al., Piqa: Reasoning about phys- -ical commonsense in natural language, in: Proceedings of the AAAI -conference on artificial intelligence, V ol. 34, 2020, pp. 7432–7439. 28, -29 -[341] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: A large scale -distantly supervised challenge dataset for reading comprehension, arXiv -preprint arXiv:1705.03551 (2017). 28, 29, 31 -[342] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, -O. Tafjord, Think you have solved question answering? try arc, the ai2 -reasoning challenge, arXiv preprint arXiv:1803.05457 (2018). 28, 29, -31 -[343] S. Aroca-Ouellette, C. Paik, A. Roncone, K. Kann, Prost: Phys- -ical reasoning of objects through space and time, arXiv preprint -arXiv:2106.03634 (2021). 29 -[344] T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor con- -duct electricity? a new dataset for open book question answering, arXiv -preprint arXiv:1809.02789 (2018). 29 -[345] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille, -D. Moussallem, A. Shimorina, The 2020 bilingual, bi-directional -webnlg +shared task overview and evaluation results (webnlg +2020), -in: Proceedings of the 3rd International Workshop on Natural Language -Generation from the Semantic Web (WebNLG +), 2020. 29 -[346] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Blow the dog whistle: -A chinese dataset for cant understanding with common sense and world -knowledge, arXiv preprint arXiv:2104.02704 (2021). 29 -[347] G. Lai, Q. Xie, H. Liu, Y . Yang, E. Hovy, Race: Large-scale -reading comprehension dataset from examinations, arXiv preprint -arXiv:1704.04683 (2017). 29 -[348] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y . Choi, P. Liang, -L. Zettlemoyer, Quac: Question answering in context, arXiv preprint -arXiv:1808.07036 (2018). 29 -[349] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did aristo- -tle use a laptop? a question answering benchmark with implicit reason- -ing strategies, Transactions of the Association for Computational Lin- -guistics 9 (2021) 346–361. 29, 31 -[350] J. Boyd-Graber, B. Satino ff, H. He, H. Daumé III, Besting the quiz mas- -ter: Crowdsourcing incremental classification games, in: Proceedings of -the 2012 joint conference on empirical methods in natural language pro- -cessing and computational natural language learning, 2012, pp. 1290– -1301. 29 -[351] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, Z. Ding, Chinese medical -question answer matching using end-to-end character-level multi-scale -cnns, Applied Sciences 7 (8) (2017) 767. 29 -[352] S. Zhang, X. Zhang, H. Wang, L. Guo, S. Liu, Multi-scale attentive in- -teraction networks for chinese medical question answer selection, IEEE -43 - ---- Page 44 --- -Access 6 (2018) 74061–74071. 29 -[353] C. Xu, J. Pei, H. Wu, Y . Liu, C. Li, Matinf: A jointly labeled large-scale -dataset for classification, question answering and summarization, arXiv -preprint arXiv:2004.12302 (2020). 29 -[354] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y . Choi, Winogrande: An -adversarial winograd schema challenge at scale, Communications of the -ACM 64 (9) (2021) 99–106. 27, 29 -[355] R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, Y . Choi, Hellaswag: Can a -machine really finish your sentence?, arXiv preprint arXiv:1905.07830 -(2019). 29 -[356] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice of plausible alter- -natives: An evaluation of commonsense causal reasoning., in: AAAI -spring symposium: logical formalizations of commonsense reasoning, -2011, pp. 90–95. 29 -[357] H. Levesque, E. Davis, L. Morgenstern, The winograd schema chal- -lenge, in: Thirteenth international conference on the principles of knowl- -edge representation and reasoning, 2012. 27, 29 -[358] A. Talmor, J. Herzig, N. Lourie, J. Berant, Commonsenseqa: A question -answering challenge targeting commonsense knowledge, arXiv preprint -arXiv:1811.00937 (2018). 29, 31 -[359] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y . Choi, Socialiqa: -Commonsense reasoning about social interactions, arXiv preprint -arXiv:1904.09728 (2019). 29 -[360] K. Sun, D. Yu, D. Yu, C. Cardie, Investigating prior knowledge for chal- -lenging chinese machine reading comprehension, Transactions of the -Association for Computational Linguistics 8 (2020) 141–155. 29 -[361] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. Van Durme, Record: Bridg- -ing the gap between human and machine commonsense reading compre- -hension, arXiv preprint arXiv:1810.12885 (2018). 29 -[362] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000 +questions -for machine comprehension of text, arXiv preprint arXiv:1606.05250 -(2016). 29, 31 -[363] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, -K. Toutanova, Boolq: Exploring the surprising di fficulty of natural -yes/no questions, arXiv preprint arXiv:1905.10044 (2019). 29, 31 -[364] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer- -able questions for squad, arXiv preprint arXiv:1806.03822 (2018). 29, -31 -[365] D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, Drop: -A reading comprehension benchmark requiring discrete reasoning over -paragraphs, arXiv preprint arXiv:1903.00161 (2019). 29, 31 -[366] I. Dagan, O. Glickman, B. Magnini, The pascal recognising textual en- -tailment challenge, in: Machine learning challenges workshop, Springer, -2005, pp. 177–190. 29, 31 -[367] Y . Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, Y . Bisk, Webqa: Mul- -tihop and multimodal qa, in: Proceedings of the IEEE /CVF Conference -on Computer Vision and Pattern Recognition, 2022, pp. 16495–16504. -29, 31 -[368] Y . Cui, T. Liu, Z. Chen, W. Ma, S. Wang, G. Hu, Dataset for the first -evaluation on chinese machine reading comprehension, arXiv preprint -arXiv:1709.08299 (2017). 29 -[369] Y . Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, G. Hu, -A span-extraction dataset for chinese machine reading comprehension, -arXiv preprint arXiv:1810.07366 (2018). 29, 31 -[370] Y . Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu, -A sentence cloze dataset for chinese machine reading comprehension, -arXiv preprint arXiv:2004.03116 (2020). 29 -[371] Y . Li, T. Liu, D. Li, Q. Li, J. Shi, Y . Wang, Character-based bilstm-crf -incorporating pos and dictionaries for chinese opinion target extraction, -in: Asian Conference on Machine Learning, PMLR, 2018, pp. 518–533. -29 -[372] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, D. Roth, Look- -ing beyond the surface: A challenge set for reading comprehension -over multiple sentences, in: Proceedings of the 2018 Conference of the -North American Chapter of the Association for Computational Linguis- -tics: Human Language Technologies, V olume 1 (Long Papers), 2018, -pp. 252–262. 29 -[373] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Al- -berti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., Natural ques- -tions: a benchmark for question answering research, Transactions of the -Association for Computational Linguistics 7 (2019) 453–466. 29[374] C. C. Shao, T. Liu, Y . Lai, Y . Tseng, S. Tsai, Drcd: A chinese ma- -chine reading comprehension dataset, arXiv preprint arXiv:1806.00920 -(2018). 29 -[375] W. He, K. Liu, J. Liu, Y . Lyu, S. Zhao, X. Xiao, Y . Liu, Y . Wang, H. Wu, -Q. She, et al., Dureader: a chinese machine reading comprehension -dataset from real-world applications, arXiv preprint arXiv:1711.05073 -(2017). 29 -[376] H. Tang, J. Liu, H. Li, Y . Hong, H. Wu, H. Wang, Dureaderrobust: A -chinese dataset towards evaluating the robustness of machine reading -comprehension models, arXiv preprint arXiv:2004.11142 (2020). 29 -[377] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing multiple choice science -questions, arXiv preprint arXiv:1707.06209 (2017). 29 -[378] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc -ranking with kernel pooling, in: Proceedings of the 40th International -ACM SIGIR conference on research and development in information -retrieval, 2017, pp. 55–64. 29 -[379] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcli ffe, R. Morante, -Qa4mre 2011-2013: Overview of question answering for machine read- -ing evaluation, in: Information Access Evaluation. Multilinguality, Mul- -timodality, and Visualization: 4th International Conference of the CLEF -Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Pro- -ceedings 4, Springer, 2013, pp. 303–320. 29 -[380] S. Lim, M. Kim, J. Lee, Korquad1. 0: Korean qa dataset for machine -reading comprehension, arXiv preprint arXiv:1909.07005 (2019). 29 -[381] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y . Feng, X. Han, -Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judg- -ment prediction, arXiv preprint arXiv:1807.02478 (2018). 29 -[382] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, -C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge -competence with apps, arXiv preprint arXiv:2105.09938 (2021). 29, 31 -[383] Y . Wang, X. Liu, S. Shi, Deep neural solver for math word problems, -in: Proceedings of the 2017 conference on empirical methods in natural -language processing, 2017, pp. 845–854. 29, 31 -[384] K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, -M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers -to solve math word problems, arXiv preprint arXiv:2110.14168 (2021). -29, 31 -[385] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, -E. Jiang, C. J. Cai, M. Terry, Q. V . Le, C. Sutton, Program synthesis with -large language models, CoRR abs /2108.07732 (2021). 29 -[386] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. V osoughi, H. W. -Chung, Y . Tay, S. Ruder, D. Zhou, et al., Language models are mul- -tilingual chain-of-thought reasoners, arXiv preprint arXiv:2210.03057 -(2022). 29 -[387] S. Roy, D. Roth, Solving general arithmetic word problems, arXiv -preprint arXiv:1608.01413 (2016). 29 -[388] S.-Y . Miao, C.-C. Liang, K.-Y . Su, A diverse corpus for evaluating -and developing english math word problem solvers, arXiv preprint -arXiv:2106.15772 (2021). 29 -[389] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, H. Hajishirzi, -Mawps: A math word problem repository, in: Proceedings of the 2016 -conference of the north american chapter of the association for computa- -tional linguistics: human language technologies, 2016, pp. 1152–1157. -29 -[390] A. Patel, S. Bhattamishra, N. Goyal, Are nlp models really able to solve -simple math word problems?, arXiv preprint arXiv:2103.07191 (2021). -29 -[391] Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, -D. Fried, S. Wang, T. Yu, Ds-1000: A natural and reliable benchmark for -data science code generation, in: International Conference on Machine -Learning, PMLR, 2023, pp. 18319–18345. 29 -[392] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, -E. Jiang, C. Cai, M. Terry, Q. Le, et al., Program synthesis with large -language models, arXiv preprint arXiv:2108.07732 (2021). 29 -[393] Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adver- -sarial nli: A new benchmark for natural language understanding, arXiv -preprint arXiv:1910.14599 (2019). 29, 31 -[394] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge -corpus for sentence understanding through inference, arXiv preprint -arXiv:1704.05426 (2017). 29 -[395] R. T. McCoy, E. Pavlick, T. Linzen, Right for the wrong reasons: Diag- -44 - ---- Page 45 --- -nosing syntactic heuristics in natural language inference, arXiv preprint -arXiv:1902.01007 (2019). 29 -[396] J. Liu, L. Cui, H. Liu, D. Huang, Y . Wang, Y . Zhang, Logiqa: A chal- -lenge dataset for machine reading comprehension with logical reason- -ing, arXiv preprint arXiv:2007.08124 (2020). 29 -[397] P. Lewis, B. O ˘guz, R. Rinott, S. Riedel, H. Schwenk, Mlqa: Eval- -uating cross-lingual extractive question answering, arXiv preprint -arXiv:1910.07475 (2019). 29 -[398] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, -H. Schwenk, V . Stoyanov, Xnli: Evaluating cross-lingual sentence rep- -resentations, arXiv preprint arXiv:1809.05053 (2018). 29, 31 -[399] Y . Yang, Y . Zhang, C. Tar, J. Baldridge, Paws-x: A cross- -lingual adversarial dataset for paraphrase identification, arXiv preprint -arXiv:1908.11828 (2019). 29, 31 -[400] S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the -summary!, Topic-Aware Convolutional Neural Networks for Extreme -Summarization. ArXiv, abs (1808). 29 -[401] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vuli ´c, A. Korhonen, -Xcopa: A multilingual dataset for causal commonsense reasoning, arXiv -preprint arXiv:2005.00333 (2020). 29 -[402] A. Tikhonov, M. Ryabinin, It’s all in the heads: Using attention heads -as a baseline for cross-lingual transfer in commonsense reasoning, arXiv -preprint arXiv:2106.12066 (2021). 29 -[403] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V . Niko- -laev, J. Palomaki, Tydi qa: A benchmark for information-seeking ques- -tion answering in typologically diverse languages, Transactions of the -Association for Computational Linguistics 8 (2020) 454–470. 29 -[404] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, -Mlsum: The multilingual summarization corpus, arXiv preprint -arXiv:2004.14900 (2020). 29 -[405] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic -human falsehoods, arXiv preprint arXiv:2109.07958 (2021). 29, 32 -[406] I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen, -C. Hansen, J. G. Simonsen, Multifc: A real-world multi-domain -dataset for evidence-based fact checking of claims, arXiv preprint -arXiv:1909.03242 (2019). 29 -[407] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a -large-scale dataset for fact extraction and verification, arXiv preprint -arXiv:1803.05355 (2018). 29 -[408] I. Mollas, Z. Chrysopoulou, S. Karlos, G. Tsoumakas, Ethos: an online -hate speech detection dataset, arXiv preprint arXiv:2006.08328 (2020). -29, 32 -[409] M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical -bias in pretrained language models, arXiv preprint arXiv:2004.09456 -(2020). 29, 32 -[410] A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thomp- -son, P. M. Htut, S. R. Bowman, Bbq: A hand-built bias benchmark for -question answering, arXiv preprint arXiv:2110.08193 (2021). 29 -[411] J. Zhao, T. Wang, M. Yatskar, V . Ordonez, K.-W. Chang, Gender bias -in coreference resolution: Evaluation and debiasing methods, arXiv -preprint arXiv:1804.06876 (2018). 29 -[412] N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A chal- -lenge dataset for measuring social biases in masked language models, -arXiv preprint arXiv:2010.00133 (2020). 29 -[413] S. Gehman, S. Gururangan, M. Sap, Y . Choi, N. A. Smith, Realtoxic- -ityprompts: Evaluating neural toxic degeneration in language models, -arXiv preprint arXiv:2009.11462 (2020). 29 -[414] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced -metrics for measuring unintended bias with real data for text classifica- -tion, in: Companion proceedings of the 2019 world wide web confer- -ence, 2019, pp. 491–500. 29 -[415] O. Bojar, R. Chatterjee, C. Federmann, Y . Graham, B. Haddow, -M. Huck, A. J. Yepes, P. Koehn, V . Logacheva, C. Monz, et al., Find- -ings of the 2016 conference on machine translation, in: Proceedings of -the First Conference on Machine Translation: V olume 2, Shared Task -Papers, 2016, pp. 131–198. 29 -[416] B. Loïc, B. Magdalena, B. Ond ˇrej, F. Christian, G. Yvette, G. Ro- -man, H. Barry, H. Matthias, J. Eric, K. Tom, et al., Findings of the -2020 conference on machine translation (wmt20), in: Proceedings of -the Fifth Conference on Machine Translation, Association for Compu- -tational Linguistics„ 2020, pp. 1–55. 29[417] W. Li, F. Qi, M. Sun, X. Yi, J. Zhang, Ccpm: A chinese classical poetry -matching dataset, arXiv preprint arXiv:2106.01979 (2021). 29 -[418] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, J. Weston, Wizard of -wikipedia: Knowledge-powered conversational agents, arXiv preprint -arXiv:1811.01241 (2018). 29 -[419] H. Rashkin, E. M. Smith, M. Li, Y .-L. Boureau, Towards empathetic -open-domain conversation models: A new benchmark and dataset, arXiv -preprint arXiv:1811.00207 (2018). 29 -[420] E. Dinan, V . Logacheva, V . Malykh, A. Miller, K. Shuster, J. Urbanek, -D. Kiela, A. Szlam, I. Serban, R. Lowe, et al., The second conversa- -tional intelligence challenge (convai2), in: The NeurIPS’18 Competi- -tion: From Machine Learning to Intelligent Conversations, Springer, -2020, pp. 187–208. 29 -[421] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kdconv: A chinese -multi-domain dialogue dataset towards multi-turn knowledge-driven -conversation, arXiv preprint arXiv:2004.04100 (2020). 29 -[422] L. CO, Iflytek: a multiple categories chinese text classifier. competition -official website (2019). 29 -[423] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The -pushshift reddit dataset, in: Proceedings of the international AAAI con- -ference on web and social media, V ol. 14, 2020, pp. 830–839. 30 -[424] A. Fan, Y . Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long -form question answering, arXiv preprint arXiv:1907.09190 (2019). 31 -[425] Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, -A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al., -Benchmarking generalization via in-context instructions on 1,600 +lan- -guage tasks, arXiv preprint arXiv:2204.07705 (2022). 31 -[426] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, -M. Zhong, P. Yin, S. I. Wang, et al., Unifiedskg: Unifying and multi- -tasking structured knowledge grounding with text-to-text language mod- -els, arXiv preprint arXiv:2201.05966 (2022). 31 -[427] Q. Ye, B. Y . Lin, X. Ren, Crossfit: A few-shot learning challenge -for cross-task generalization in nlp, arXiv preprint arXiv:2104.08835 -(2021). 31 -[428] V . Aribandi, Y . Tay, T. Schuster, J. Rao, H. S. Zheng, S. V . Mehta, -H. Zhuang, V . Q. Tran, D. Bahri, J. Ni, et al., Ext5: Towards extreme -multi-task scaling for transfer learning, arXiv preprint arXiv:2111.10952 -(2021). 31 -[429] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge cor- -pus for sentence understanding through inference, in: Proceedings of -the 2018 Conference of the North American Chapter of the Associ- -ation for Computational Linguistics: Human Language Technologies, -V olume 1 (Long Papers), Association for Computational Linguistics, -New Orleans, Louisiana, 2018, pp. 1112–1122. doi:10.18653/v1/ -N18-1101 . -URL https://aclanthology.org/N18-1101 31 -[430] Y . Zhang, J. Baldridge, L. He, PAWS: Paraphrase adversaries from word -scrambling, in: Proceedings of the 2019 Conference of the North Amer- -ican Chapter of the Association for Computational Linguistics: Human -Language Technologies, V olume 1 (Long and Short Papers), Associa- -tion for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. -1298–1308. doi:10.18653/v1/N19-1131 . -URL https://aclanthology.org/N19-1131 32 -[431] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is chat- -GPT a general-purpose natural language processing task solver?, in: The -2023 Conference on Empirical Methods in Natural Language Process- -ing, 2023. -URL https://openreview.net/forum?id=u03xn1COsO 32 -[432] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, -N. Akhtar, J. Wu, S. Mirjalili, et al., Large language models: a com- -prehensive survey of its applications, challenges, limitations, and future -prospects, TechRxiv (2023). 32 -[433] X. L. Dong, S. Moon, Y . E. Xu, K. Malik, Z. Yu, Towards next- -generation intelligent assistants leveraging llm techniques, in: Proceed- -ings of the 29th ACM SIGKDD Conference on Knowledge Discovery -and Data Mining, 2023, pp. 5792–5793. 32 -[434] K. Pandya, M. Holia, Automating customer service using langchain: -Building custom open-source gpt chatbot for organizations, arXiv -preprint arXiv:2310.05421 (2023). 32 -[435] J. Li, B. Hui, G. Qu, B. Li, J. Yang, B. Li, B. Wang, B. Qin, R. Cao, -R. Geng, et al., Can llm already serve as a database interface? a -45 - ---- Page 46 --- -big bench for large-scale database grounded text-to-sqls, arXiv preprint -arXiv:2305.03111 (2023). 32 -[436] A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, M. D. Succi, Evaluating -chatgpt as an adjunct for radiologic decision-making, medRxiv (2023) -2023–02. 32 -[437] M. Benary, X. D. Wang, M. Schmidt, D. Soll, G. Hilfenhaus, M. Nas- -sir, C. Sigler, M. Knödler, U. Keller, D. Beule, et al., Leveraging large -language models for decision support in personalized oncology, JAMA -Network Open 6 (11) (2023) e2343689–e2343689. 32 -[438] C. M. Chiesa-Estomba, J. R. Lechien, L. A. Vaira, A. Brunet, G. Cam- -maroto, M. Mayo-Yanez, A. Sanchez-Barrueco, C. Saga-Gutierrez, Ex- -ploring the potential of chat-gpt as a supportive tool for sialendoscopy -clinical decision making and patient information support, European -Archives of Oto-Rhino-Laryngology (2023) 1–6. 32 -[439] S. Montagna, S. Ferretti, L. C. Klopfenstein, A. Florio, M. F. Pengo, -Data decentralisation of llm-based chatbot systems in chronic disease -self-management, in: Proceedings of the 2023 ACM Conference on In- -formation Technology for Social Good, 2023, pp. 205–212. 32 -[440] D. Bill, T. Eriksson, Fine-tuning a llm using reinforcement learning from -human feedback for a therapy chatbot application (2023). 32 -[441] M. Abbasian, I. Azimi, A. M. Rahmani, R. Jain, Conversational health -agents: A personalized llm-powered agent framework, arXiv preprint -arXiv:2310.02374 (2023). 32 -[442] K. V . Lemley, Does chatgpt help us understand the medical literature?, -Journal of the American Society of Nephrology (2023) 10–1681. 32 -[443] S. Pal, M. Bhattacharya, S.-S. Lee, C. Chakraborty, A domain-specific -next-generation large language model (llm) or chatgpt is required for -biomedical engineering and research, Annals of Biomedical Engineering -(2023) 1–4. 32 -[444] Y . Du, S. Zhao, Y . Chen, R. Bai, J. Liu, H. Wu, H. Wang, B. Qin, The -calla dataset: Probing llms’ interactive knowledge acquisition from chi- -nese medical literature, arXiv preprint arXiv:2309.04198 (2023). 32 -[445] A. Abd-Alrazaq, R. AlSaad, D. Alhuwail, A. Ahmed, P. M. Healy, -S. Latifi, S. Aziz, R. Damseh, S. A. Alrazak, J. Sheikh, et al., Large -language models in medical education: Opportunities, challenges, and -future directions, JMIR Medical Education 9 (1) (2023) e48291. 32 -[446] A. B. Mbakwe, I. Lourentzou, L. A. Celi, O. J. Mechanic, A. Dagan, -Chatgpt passing usmle shines a spotlight on the flaws of medical educa- -tion (2023). 32 -[447] S. Ahn, The impending impacts of large language models on medical -education, Korean Journal of Medical Education 35 (1) (2023) 103. 32 -[448] E. Waisberg, J. Ong, M. Masalkhi, A. G. Lee, Large language model -(llm)-driven chatbots for neuro-ophthalmic medical education, Eye -(2023) 1–3. 32 -[449] G. Deiana, M. Dettori, A. Arghittu, A. Azara, G. Gabutti, P. Castiglia, -Artificial intelligence and public health: Evaluating chatgpt responses to -vaccination myths and misconceptions, Vaccines 11 (7) (2023) 1217. 32 -[450] L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. -Tozzi, C. Rizzo, Chatgpt and the rise of large language models: the new -ai-driven infodemic threat in public health, Frontiers in Public Health 11 -(2023) 1166120. 32 -[451] N. L. Rane, A. Tawde, S. P. Choudhary, J. Rane, Contribution and per- -formance of chatgpt and other large language models (llm) for scientific -and research advancements: a double-edged sword, International Re- -search Journal of Modernization in Engineering Technology and Science -5 (10) (2023) 875–899. 32 -[452] W. Dai, J. Lin, H. Jin, T. Li, Y .-S. Tsai, D. Gaševi ´c, G. Chen, Can large -language models provide feedback to students? a case study on chatgpt, -in: 2023 IEEE International Conference on Advanced Learning Tech- -nologies (ICALT), IEEE, 2023, pp. 323–325. 32 -[453] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, -F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al., -Chatgpt for good? on opportunities and challenges of large language -models for education, Learning and individual di fferences 103 (2023) -102274. 32 -[454] N. Rane, Enhancing the quality of teaching and learning through chat- -gpt and similar large language models: Challenges, future prospects, -and ethical considerations in education, Future Prospects, and Ethical -Considerations in Education (September 15, 2023) (2023). 32 -[455] J. C. Young, M. Shishido, Investigating openai’s chatgpt potentials in -generating chatbot’s dialogue for english as a foreign language learning,International Journal of Advanced Computer Science and Applications -14 (6) (2023). 32 -[456] J. Irons, C. Mason, P. Cooper, S. Sidra, A. Reeson, C. Paris, Exploring -the impacts of chatgpt on future scientific work, SocArXiv (2023). 32 -[457] P. G. Schmidt, A. J. Meir, Using generative ai for literature searches and -scholarly writing: Is the integrity of the scientific discourse in jeopardy?, -arXiv preprint arXiv:2311.06981 (2023). 32 -[458] Y . Zheng, H. Y . Koh, J. Ju, A. T. Nguyen, L. T. May, G. I. Webb, S. Pan, -Large language models for scientific synthesis, inference and explana- -tion, arXiv preprint arXiv:2310.07984 (2023). 33 -[459] B. Aczel, E.-J. Wagenmakers, Transparency guidance for chatgpt usage -in scientific writing, PsyArXiv (2023). 33 -[460] S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in sci- -entific writing: a friend or a foe?, Reproductive BioMedicine Online -(2023). 33 -[461] S. Imani, L. Du, H. Shrivastava, Mathprompter: Mathematical reasoning -using large language models, arXiv preprint arXiv:2303.05398 (2023). -33 -[462] Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, C. Zhou, Scaling relationship -on learning mathematical reasoning with large language models, arXiv -preprint arXiv:2308.01825 (2023). 33 -[463] K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, -R. Prenger, A. Anandkumar, Leandojo: Theorem proving with retrieval- -augmented language models, arXiv preprint arXiv:2306.15626 (2023). -33 -[464] K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, -T. Lukasiewicz, Y . Wu, J. B. Tenenbaum, W. Hart, et al., Evaluating -language models for mathematics through interactions, arXiv preprint -arXiv:2306.01694 (2023). 33 -[465] Y . Liu, T. Han, S. Ma, J. Zhang, Y . Yang, J. Tian, H. He, A. Li, M. He, -Z. Liu, et al., Summary of chatgpt-related research and perspective -towards the future of large language models, Meta-Radiology (2023) -100017. 33 -[466] J. Drápal, H. Westermann, J. Savelka, Using large language models -to support thematic analysis in empirical legal studies, arXiv preprint -arXiv:2310.18729 (2023). 33 -[467] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, H. Xu, Explain- -ing legal concepts with augmented large language models (gpt-4), arXiv -preprint arXiv:2306.09525 (2023). 33 -[468] N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana, -A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, et al., Legal- -bench: A collaboratively built benchmark for measuring legal reasoning -in large language models, arXiv preprint arXiv:2308.11462 (2023). 33 -[469] J. Cui, Z. Li, Y . Yan, B. Chen, L. Yuan, Chatlaw: Open-source legal -large language model with integrated external knowledge bases, arXiv -preprint arXiv:2306.16092 (2023). 33 -[470] H. Yang, X.-Y . Liu, C. D. Wang, Fingpt: Open-source financial large -language models, arXiv preprint arXiv:2306.06031 (2023). 33 -[471] Y . Li, S. Wang, H. Ding, H. Chen, Large language models in finance: A -survey, in: Proceedings of the Fourth ACM International Conference on -AI in Finance, 2023, pp. 374–382. 33 -[472] A. Lykov, D. Tsetserukou, Llm-brain: Ai-driven fast generation of -robot behaviour tree based on large language model, arXiv preprint -arXiv:2305.19352 (2023). 33 -[473] E. Billing, J. Rosén, M. Lamb, Language models for human-robot inter- -action, in: ACM /IEEE International Conference on Human-Robot Inter- -action, March 13–16, 2023, Stockholm, Sweden, ACM Digital Library, -2023, pp. 905–906. 33 -[474] Y . Ye, H. You, J. Du, Improved trust in human-robot collaboration with -chatgpt, IEEE Access (2023). 33 -[475] Y . Ding, X. Zhang, C. Paxton, S. Zhang, Leveraging commonsense -knowledge from large language models for task and motion planning, -in: RSS 2023 Workshop on Learning for Task and Motion Planning, -2023. 33 -[476] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, -S. Rusinkiewicz, T. Funkhouser, Tidybot: Personalized robot assistance -with large language models, arXiv preprint arXiv:2305.05658 (2023). -33 -[477] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations -for deep learning in nlp, arXiv preprint arXiv:1906.02243 (2019). 34 -[478] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dan- -46 - ---- Page 47 --- -gers of stochastic parrots: Can language models be too big?, in: Pro- -ceedings of the 2021 ACM conference on fairness, accountability, and -transparency, 2021, pp. 610–623. 34 -[479] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding -deep learning (still) requires rethinking generalization, Communications -of the ACM 64 (3) (2021) 107–115. 34 -[480] M. Tänzer, S. Ruder, M. Rei, Memorisation versus generalisation in pre- -trained language models, arXiv preprint arXiv:2105.00828 (2021). 34 -[481] S. M. West, M. Whittaker, K. Crawford, Discriminating systems, AI -Now (2019) 1–33. 34 -[482] K. Valmeekam, A. Olmo, S. Sreedharan, S. Kambhampati, Large lan- -guage models still can’t plan (a benchmark for llms on planning and -reasoning about change), arXiv preprint arXiv:2206.10498 (2022). 34 -[483] Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, -Y . Zhang, Y . Chen, et al., Siren’s song in the ai ocean: A survey on hal- -lucination in large language models, arXiv preprint arXiv:2309.01219 -(2023). 34 -[484] A. Webson, E. Pavlick, Do prompt-based models really understand the -meaning of their prompts?, arXiv preprint arXiv:2109.01247 (2021). 34 -[485] O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second -thought, let’s not think step by step! bias and toxicity in zero-shot rea- -soning, arXiv preprint arXiv:2212.08061 (2022). 34 -[486] B. C. Das, M. H. Amini, Y . Wu, Security and privacy challenges of large -language models: A survey, arXiv preprint arXiv:2402.00888 (2024). 34 -[487] X. Liu, H. Cheng, P. He, W. Chen, Y . Wang, H. Poon, J. Gao, Adversar- -ial training for large neural language models, ArXiv (April 2020). -URL https://www.microsoft.com/en-us/research/ -publication/adversarial-training-for-large-neural-language-models/ -34 -[488] E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, N. Abu- -Ghazaleh, Survey of vulnerabilities in large language models revealed -by adversarial attacks (2023). arXiv:2310.10844 . 34 -[489] X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, M. Kankanhalli, An -llm can fool itself: A prompt-based adversarial attack (2023). arXiv: -2310.13345 . 34 -[490] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, -M. Du, Explainability for large language models: A survey (2023). -arXiv:2309.01029 . 35 -[491] S. Huang, S. Mamidanna, S. Jangam, Y . Zhou, L. H. Gilpin, Can large -language models explain themselves? a study of llm-generated self- -explanations (2023). arXiv:2310.11207 . 35 -[492] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, What does it -mean for a language model to preserve privacy?, in: Proceedings of the -2022 ACM Conference on Fairness, Accountability, and Transparency, -2022, pp. 2280–2292. 35 -[493] R. Plant, V . Giu ffrida, D. Gkatzia, You are what you write: Pre- -serving privacy in the era of large language models, arXiv preprint -arXiv:2204.09391 (2022). 35 -[494] W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu, -B. Ren, Y . Wang, Real-time execution of large-scale language models -on mobile (2020). arXiv:2009.06823 . 35 -[495] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y . Liu, M. Guo, -Y . Zhu, Olive: Accelerating large language models via hardware- -friendly outlier-victim pair quantization, in: Proceedings of the 50th -Annual International Symposium on Computer Architecture, 2023, pp. -1–15. 35 -[496] B. Meskó, E. J. Topol, The imperative for regulatory oversight of large -language models (or generative ai) in healthcare, npj Digital Medicine -6 (1) (2023) 120. 35 -[497] J. Zhang, X. Ji, Z. Zhao, X. Hei, K.-K. R. Choo, Ethical considerations -and policy implications for large language models: Guiding responsible -development and deployment, arXiv preprint arXiv:2308.02678 (2023). -35 -[498] J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Auditing large language -models: a three-layered approach, AI and Ethics (2023) 1–31. 35 -47 \ No newline at end of file