58.6 kB

Title: LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

URL Source: https://arxiv.org/html/2304.01933

Markdown Content: Zhiqiang Hu 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Lei Wang 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Yihuai Lan Wanyu Xu 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Ee-Peng Lim 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

Lidong Bing 3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Xing Xu 5 5{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT Soujanya Poria 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Roy Ka-Wei Lee 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Singapore University of Technology and Design

2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Singapore Management University

3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT DAMO Academy, Alibaba Group

4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Southwest Jiaotong University

5 5{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT University of Electronic Science and Technology of China

Abstract

The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of adapter types, placement locations, and hyper-parameters to the best design for each adapter-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on both reasoning tasks.

1 Introduction

Large language models (LLMs), such as ChatGPT OpenAI (2022) and GPT-4 OpenAI (2023), have demonstrated unprecedented performance across various natural language processing (NLP) tasks Qin et al. (2023) and multi-modal tasks Shen et al. (2023). These LLMs often possess sizes exceeding hundreds of billions of parameters and are closed-source. Consequently, this has spurred the development of accessible and cost-effective alternatives such as LLaMA Touvron et al. (2023). These alternatives involve fine-tuning open-source LLMs utilizing either task-specific data (e.g., ChatDoctor Yunxiang et al. (2023)) or instructional data (e.g., Alpaca Taori et al. (2023)). However, full-model fine-tuning (FFT) is computationally and storage-intensive, thereby presenting significant challenges in practical implementation.

Prior to the emergence of FFT of LLMs (e.g., LLaMA), a compelling solution called parameter-efficient fine-tuning (PEFT)Houlsby et al. (2019) has been proposed in the NLP field, specifically for pre-trained models (e.g., BERT Devlin et al. (2018)), offering a promising approach for efficiently fine-tuning LLMs. The advantage of PEFT lies in its ability to fine-tune only a small set of external parameters rather than the entire backbone model while still achieving comparable or even superior performance Mangrulkar et al. (2022). Moreover, PEFT can effectively mitigate catastrophic forgetting in comparison to FFT Wang et al. (2022). As shown in Table1, the advantage of PEFT has resulted in the developing of diverse PEFT modules, encompassing series adapters Houlsby et al. (2019); Wang et al. (2022); He et al. (2022b); Fu et al. (2021), parallel adapters He et al. (2022a), reparameterization-based methods Hu et al. (2021); Edalati et al. (2022), and prompt-based learning methods Lester et al. (2021); Li and Liang (2021).

By incorporating these PEFT modules into backbone models (i.e., LLMs), we can capitalize on the remarkable capabilities of backbone models without requiring extensive computational resources. This opens up opportunities for a broader range of applications, enabling even those with limited access to high-performance computing to harness the power of LLMs in their specific tasks. Despite the success of PEFT for pre-trained models, it remains unclear which PEFT module, in combination with which layer and hyperparameter configuration, is most suitable for a given task or dataset when meeting LLMs (e.g., LLaMA Touvron et al. (2023)). Therefore, further investigation is needed to determine the optimal PEFT setup that maximizes performance across different tasks and datasets.

Motivated by this, in this paper, we conduct a comprehensive empirical study of PEFT of three representative open-source LLMs, including BLOOM Muennighoff et al. (2022), GPT-J Wang and Komatsuzaki (2021), and LLaMA Touvron et al. (2023). Specifically, we undertake an empirical study to address the following three research questions: (i 𝑖 i italic_i) What is the optimal placement and configuration of different PEFT methods? (i⁢i 𝑖 𝑖 ii italic_i italic_i) How’s the performance of different adapters across downstream tasks? And (i⁢i⁢i 𝑖 𝑖 𝑖 iii italic_i italic_i italic_i) What are the differences in performance between in-distribution (ID) and out-of-distribution (OOD) scenarios for PEFT methods? The findings of our study are as follows:

1. The optimal placement for the series adapter, parallel adapter, and LoRA is after the MLP layers, parallel with the MLP layers, and located after both the Attention layers and MLP layers simultaneously, respectively;
1. Smaller language models with the PEFT approach can attain competitive or superior performance on specific tasks compared to larger language models. For instance, LLaMA-13B with LoRA can outperform GPT-3.5 (>175B) on MultiArith, AddSub, and SingleEq ;
1. The ID fine-tuned LLaMA-13B with adapters outperforms ChatGPT on commonsense reasoning tasks indicating that smaller language models have the potential to outperform larger language models on specific tasks with ID fine-tunig data.

Our contributions can be summarized as follows:

• We conduct a comprehensive empirical study of various PEFT methods applied in different open-source LLMs.
• To facilitate our empirical study, we construct two high-quality training datasets to enhance PEFT performance in math reasoning and commonsense reasoning tasks.
• We develop a user-friendly framework, LLM-Adapter, seamlessly integrates diverse adapters into LLMs, empowering researchers to implement adapter-based PEFT methods for a wide range of tasks.
• We conduct extensive experiments to answer the three research questions to serve as inspiration for future research.

2 PEFT Overview

Figure 1: A detailed illustration of the model architectures of three different adapters: (a) Prefix-Tuning, (b) LoRA, (c) Series Adapter, and (d) Parallel Adapter.

Table 1: The PEFT methods are categorized based on the four common basic methods. "Prompt" represents prompt-based learning methods, "Repara" denotes reparametrization-based methods, "Series" is Series Adapter, while "Parallel" represents Parallel Adapter.

In this section, we provide a brief overview of four parameter-efficient fine-tuning (PEFT) methods: prompt-based learning, reparametrization-based methods, series adapters, and parallel adapters. Li and Liang (2021); Hu et al. (2021); Houlsby et al. (2019); He et al. (2022a)

Prompt-based learning.

As shown in Figure1(a), prompt-based learning transforms the discrete optimization problem of finding the optimal hard prompt into a continuous (soft) prompt. To achieve this, Lester et al. (2021) proposed the concept of prompt tuning, where a trainable tensor is added as a prefix to the input embeddings. Another approach called Prefix Tuning Li and Liang (2021) independently explored the addition of soft prompts to the hidden states of all layers. Intrinsic Prompt Tuning Qin et al. (2021) employs an autoencoder to compress and decompress the soft prompt. We take learnable vectors incorporated into the attention layer as an example of prompt-based learning, which can be formulated as follows:

H o=Attn⁢(H i⁢W Q,[P K;H i⁢W K],[P V;H i⁢W V]),subscript 𝐻 𝑜 Attn subscript 𝐻 𝑖 subscript 𝑊 𝑄 subscript 𝑃 𝐾 subscript 𝐻 𝑖 subscript 𝑊 𝐾 subscript 𝑃 𝑉 subscript 𝐻 𝑖 subscript 𝑊 𝑉\displaystyle H_{o}=\text{Attn}(H_{i}W_{Q},[P_{K};H_{i}W_{K}],[P_{V};H_{i}W_{V% }]),italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = Attn ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , [ italic_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ; italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] , [ italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ; italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ] ) ,(1)

where H i∈R T×d subscript 𝐻 𝑖 superscript R 𝑇 𝑑 H_{i}\in\mathrm{R}^{T\times d}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT and H o∈R T×d subscript 𝐻 𝑜 superscript R 𝑇 𝑑 H_{o}\in\mathrm{R}^{T\times d}italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT are the input and output of the attention layer respectively. Note that T 𝑇 T italic_T is the maximum input length and d 𝑑 d italic_d is the vector dimension. P K∈R L×d subscript 𝑃 𝐾 superscript R 𝐿 𝑑 P_{K}\in\mathrm{R}^{L\times d}italic_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT and P V∈R L×d subscript 𝑃 𝑉 superscript R 𝐿 𝑑 P_{V}\in\mathrm{R}^{L\times d}italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT are the learnable vectors for PEFT. L 𝐿 L italic_L is the number of learnable tokens, which is discussed in the experiment section in detail. Q,K,V 𝑄 𝐾 𝑉 Q,K,V italic_Q , italic_K , italic_V denote the query, key, value vectors of th attention module, respectively.

Reparametrization-based method.

This type of methods aim to transform network weights using a low-rank technique. This approach effectively reduces the number of trainable parameters while preserving the ability to handle high-dimensional matrices. Intrinsic SAID Aghajanyan et al. (2020) investigates the intrinsic dimensionality of fine-tuning within a low-rank subspace. LoRA Hu et al. (2021) introduces a simple approach to update the parameters of a weight matrix by decomposing it into a product of two low-rank matrices. KronA Edalati et al. (2022) improves upon the matrix factorization aspect of LoRA by utilizing the Kronecker product in its technique. We take LoRA as an example of Reparametrization-based learning, which can be formulated below:

H o=H i⁢W 0+H i⁢Δ⁢W=H i⁢W 0+H i⁢B⁢A,subscript 𝐻 𝑜 subscript 𝐻 𝑖 subscript 𝑊 0 subscript 𝐻 𝑖 Δ 𝑊 subscript 𝐻 𝑖 subscript 𝑊 0 subscript 𝐻 𝑖 𝐵 𝐴\displaystyle H_{o}=H_{i}W_{0}+H_{i}\Delta W=H_{i}W_{0}+H_{i}BA,italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_W = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B italic_A ,(2)

where W 0∈R d×d subscript 𝑊 0 superscript R 𝑑 𝑑 W_{0}\in\mathrm{R}^{d\times d}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT can be any pre-trained weight matrix, including weights in the MLP or Attention layer. B∈R r×d 𝐵 superscript R 𝑟 𝑑 B\in\mathrm{R}^{r\times d}italic_B ∈ roman_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT and A∈R r×d 𝐴 superscript R 𝑟 𝑑 A\in\mathrm{R}^{r\times d}italic_A ∈ roman_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT are lower-rank matrix intended for covering Δ⁢W Δ 𝑊\Delta W roman_Δ italic_W. r≪d much-less-than 𝑟 𝑑 r\ll d italic_r ≪ italic_d is an important hyper-parameter for LoRA.

Series Adapter.

Series adapters involve incorporating additional learnable modules in a sequential manner within a specific sublayer. In their study, Houlsby et al. (2019) proposed integrating fully-connected networks after the attention and FFN layers in the Transformer model Vaswani et al. (2017). Another finding by Pfeiffer et al. (2020) revealed that achieving comparable performance is possible by inserting the adapter solely after the self-attention layer, instead of using two adapters per transformer block. AdaMix(Wang et al., 2022) introduces a method that utilizes multiple series adapters in a mixture-of-experts (MoE) fashion. Compacter(Henderson et al., 2021) utilizes the Kronecker product, low-rank matrices, and parameter sharing across layers to generate adapter weights. This technique aims to reduce the computational complexity associated with the adapters while maintaining their performance. Series Adapter can be formulated as follows:

H o←H o+f⁢(H o⁢W d⁢o⁢w⁢n)⁢W u⁢p,←subscript 𝐻 𝑜 subscript 𝐻 𝑜 𝑓 subscript 𝐻 𝑜 subscript 𝑊 𝑑 𝑜 𝑤 𝑛 subscript 𝑊 𝑢 𝑝 H_{o}\leftarrow H_{o}+f(H_{o}W_{down})W_{up},italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ← italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_f ( italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ,(3)

where the output H o subscript 𝐻 𝑜 H_{o}italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT of a specific layer, such as the MLP layer, is first down-projected by W d⁢o⁢w⁢n∈R d×r subscript 𝑊 𝑑 𝑜 𝑤 𝑛 superscript R 𝑑 𝑟 W_{down}\in\mathrm{R}^{d\times r}italic_W start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT to a lower dimension r 𝑟 r italic_r, and then up-projected back by W u⁢p∈R r×d subscript 𝑊 𝑢 𝑝 superscript R 𝑟 𝑑 W_{up}\in\mathrm{R}^{r\times d}italic_W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT to the original dimension d 𝑑 d italic_d. f 𝑓 f italic_f is a non-linear function. We discuss the choice of r 𝑟 r italic_r in the experiment Section.

Parallel Adapter.

Parallel adapters He et al. (2022a) aim to incorporate additional learnable modules in parallel with distinct sublayers within the backbone model. The parallel adapter can be formulated below:

H o←H o+f⁢(H i⁢W d⁢o⁢w⁢n)⁢W u⁢p,←subscript 𝐻 𝑜 subscript 𝐻 𝑜 𝑓 subscript 𝐻 𝑖 subscript 𝑊 𝑑 𝑜 𝑤 𝑛 subscript 𝑊 𝑢 𝑝 H_{o}\leftarrow H_{o}+f(H_{i}W_{down})W_{up},italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ← italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_f ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_d italic_o italic_w italic_n end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ,(4)

where H i subscript 𝐻 𝑖 H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (H o subscript 𝐻 𝑜 H_{o}italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT) is the input (output) of a specific layer. Expanding on this concept, the Multi-head Parallel Adapter takes it a step further by using parallel adapters to modify the outputs of head attention. On the other hand, the Scaled Parallel Adapter is a variant that applies the composition and insertion format of LoRA Hu et al. (2021) to adapters. Another approach, called Ladder Side-Tuning Sung et al. (2022), involves training a lightweight ladder side network. This network accepts intermediate activations from the backbone networks through shortcut connections (ladders).

3 Experiment Setup

3.1 Benchmarks

We conduct extensive empirical studies on fourteen benchmark datasets from two categories of reasoning problems: Arithmetic Reasoning: (1) the GSM8K(Cobbe et al., 2021) dataset consists of high quality linguistically diverse grade school math word problems created by human problem writers, (2) the SVAMP(Patel et al., 2021) benchmark consists of one-unknown arithmetic word problems for up-to-4 grade level students by making simple changes to a set of problems from another existing dataset, (3) the MultiArith(Roy and Roth, 2016) dataset of math word problems requiring multiple reasoning steps and operations, (4) the AddSub Hosseini et al. (2014) dataset of addition and subtraction arithmetic word problems, (5) the AQuA(Ling et al., 2017) dataset of algebraic word problems with natural language rationales, and (6) the SingleEq(Koncel-Kedziorski et al., 2015) dataset of grade-school algebra word problems that map to single equations with varying length; Commonsense Reasoning: (1) the BoolQ Clark et al. (2019) dataset is a question-answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring and generated in unprompted and unconstrained settings, (2) the PIQA Bisk et al. (2020) dataset of questions with two solutions requiring physical commonsense to answer, (3) the SIQA Sap et al. (2019) focuses on reasoning about people’s actions and their social implications, (4) the HellaSwag dataset of commonsense NLI questions including a context and several endings which complete the context, (5) the WinoGrande Sakaguchi et al. (2021) dataset is formulated as a fill-in-a-blank task with binary options, and the goal is to choose the right option for a given sentence which requires commonsense reasoning, (6) the ARC-c and (7) the ARC-e are the Challenge Set and Easy Set of ARC Clark et al. (2018) dataset of genuine grade-school level, multiple-choice science questions, and (8) the OBQA dataset contains questions requiring multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. Table 2 shows the dataset statistics.

Table 2: Details of datasets being evaluated. Math: arithmetic reasoning. CS: commonsense reasoning.

3.2 Fine-tuning Data Collection

In order to perform fine-tuning on adapters, we acquire two high-quality training datasets specifically designed for math reasoning and commonsense reasoning. Table 2 reveals that only GSM8K and AQuA datasets provide training sets for arithmetic reasoning. To enhance the diversity of our data, we incorporate the training sets from GSM8K, MAWPS, MAWPS-single Koncel-Kedziorski et al. (2016), and select 1000 examples from AQuA for the purpose of collecting the fine-tuning data. However, it is worth noting that the chosen datasets solely offer equations and corresponding answers. In order to augment the reasoning capabilities of our model, particularly in terms of providing step-by-step rationales, we leverage ChatGPT as the teacher model. By utilizing zero-shot chain-of-thought prompts, ChatGPT generates reasoning steps. We have included the specific prompt templates used to collect the math reasoning dataset in Appendix A.1. To ensure the quality of the data, we eliminate samples that contain incorrect answers. As a result, we obtain a set of 10K math reasoning samples, referred to as Math10K, which we consider for further analysis and fine-tuning.

To facilitate fine-tuning in the domain of commonsense reasoning, we construct fine-tuning data by formatting the training sets from BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, and OBQA with pre-defined templates. As each dataset in the commonsense reasoning domain entails distinct tasks, we adopt a structured template by initially describing the task’s goal, followed by the corresponding content and answer. The template utilized for creating the fine-tuning data can be found in A.2. Upon completion of this process, we obtain a collection of 170K commonsense reasoning samples, which we refer to as Commonsense170K. These datasets will be made publicly available to encourage further research and exploration in this area.

3.3 Implementations

To facilitate the seamless utilization of PEFT methods in both research and practical applications, we have developed a user-friendly framework, LLM-Adapter. LLM-Adapters seamlessly integrates diverse adapters into LLMs, empowering researchers to implement adapter-based PEFT methods for a wide range of tasks. We utilize LLaMA (7B, 13B) Touvron et al. (2023), BLOOMz (7B) Muennighoff et al. (2022), and GPT-J (6B) Wang and Komatsuzaki (2021) as the base models for our experiments. As for the four categories of PEFT methods, we select Prefix-Tuning Li and Liang (2021), Series Adapter Houlsby et al. (2019), LoRA Hu et al. (2021), and Parallel adapter He et al. (2022a) as representative candidates to examine their efficacy. For consistency across all fine-tuning experiments, we maintain a batch size of 16. The learning rate for Prefix-Tuning is set to 3e-2, while the rest of the methods adopt a learning rate of 3e-4. Each of the PEFT methods is fine-tuned for three epochs on the fine-tuning datasets. It is important to note that we fine-tune a single model for either the math or commonsense reasoning task, and subsequently evaluate its performance across all corresponding datasets.

4 Experiment Results

4.1 Placement and Configuration

To address the research question, “What is the optimal placement and configuration for various types of adapters?”, we employ LLaMA-7B as the base model to assess different adapter settings within the context of the math reasoning task. Our empirical study begins by determining the most effective placement for the Series Adapter, Parallel Adapter, and LoRA. Prefix-Tuning is excluded from this analysis since its placement is predetermined. For the Series Adapter, we explore its placement options after the multi-head attention layers, MLP layers, or both of them. As for the Parallel Adapter and LoRA, we integrate them into the multi-head attention layers, MLP layers, or both of them, in order to assess their respective performances. The detailed results on each dataset are shown in Appendix A.3. Figure 2 shows the average accuracy on math reasoning datasets. We can observe that for the Series Adapter, the best position is to place it after the MLP layers, achieving an average accuracy of 59.5%percent 59.5 59.5%59.5 % on the math reasoning datasets. As for the Parallel Adapter, when we place it within the MLP layers, it achieves the best performance of 61.7%percent 61.7 61.7%61.7 %. Regarding LoRA, we need to insert it simultaneously into both the Multi-head Attention layers and MLP layers to achieve the best performance of 60%percent 60 60%60 %.

Figure 2: The average accuracy of different adapter locations on math reasoning datasets.

Figure 3: The average accuracy of different variable settings on math reasoning datasets. Where "vt" refers to the number of virtual tokens, "bn" denotes the bottleneck size, while "r" is the LoRA rank.

Table 3: Accuracy comparison of LLMs with different adapters on six math reasoning datasets. We use GPT-3.5 text-Davinci-003 for Zero-shot CoT Kojima et al. (2022) as the baseline.

In order to determine the optimal configuration of various adapters, we conduct an analysis of the most crucial variable for each type of the PEFT methods. We compare the average accuracy on math reasoning datasets. The placement of adapters follows the optimal settings derived from the placement analysis. Regarding Prefix-tuning, we assess the performance with different numbers of virtual tokens (v⁢t 𝑣 𝑡 vt italic_v italic_t) set at [10,20,30,40]10 20 30 40[10,20,30,40][ 10 , 20 , 30 , 40 ]. For Series and Parallel Adapters, we evaluate the impact of the bottleneck size (b⁢n 𝑏 𝑛 bn italic_b italic_n) with values of [64,128,256,512]64 128 256 512[64,128,256,512][ 64 , 128 , 256 , 512 ]. For LoRA, we examine the influence of different rank values (r 𝑟 r italic_r) at [4,8,16,32]4 8 16 32[4,8,16,32][ 4 , 8 , 16 , 32 ]. The detailed results for each dataset can be found in Appendix A.4. Figure 3 presents the average accuracy of different variables on math reasoning datasets. It can be noted that when the number of virtual tokens in Prefix-Tuning is set to 10, Prefix-Tuning attains an average accuracy of 42.0%percent 42.0 42.0%42.0 % on math reasoning datasets. By configuring the bottleneck dimension to 256, Series and Parallel Adapter demonstrate the highest level of performance. However, when the bottleneck size is increased to 512, the accuracy of both Series and Parallel Adapter decreases. The typical setting for LoRA rank is set to 8, but we have discovered that a larger rank can enhance the performance of LoRA. When the rank is increased from 8 to 32, the average accuracy of LoRA increases from 60.0%percent 60.0 60.0%60.0 %61.9%percent 61.9 61.9%61.9 %.

Based on our comprehensive placement and configuration analysis, we have determined the optimal settings for each adapter, which will be consistently employed throughout the subsequent experiments.

• For Prefix-Tuning, we establish the number of virtual tokens at 10.
• For Series and Parallel Adapter, we seamlessly incorporate them into the MLP layers, configuring the bottleneck size to 256.
• Regarding LoRA, we seamlessly integrate it into both the Multi-head Attention layers and the MLP layers with rank 32.

4.2 Arithmetic Reasoning

Table 4: Accuracy comparison of LLMs with different adapters on eight commonsense reasoning datasets. The ChatGPT results are obtained by Zero-shot CoT with gpt-3.5-turbo API.

In order to evaluate the effectiveness of adapters on the Arithmetic Reasoning task, we conducted a study where adapters are fine-tuned on the Math10K dataset and subsequently evaluated on six different math reasoning datasets. As our baseline, we utilize the GPT-3.5 model, specifically the text-Davinci-003 variant, for Zero-shot CoT according to Kojima et al. (2022). The results of the GPT-3.5 model can be found in Wang et al. (2023). Table 3 reports the performance of different PEFT methods and the baseline. On average, the GPT-3.5 model (175B) outperforms adapter-based PEFT LLMs in terms of accuracy. However, for simpler math reasoning datasets such as MultiArith, AddSub, and SingleEq, adapter-based methods like LLaMA-13B with LoRA outperform GPT-3.5. Notably, LLaMA-13B with LoRA achieves an average accuracy of 65.4%, which is approximately 92.8% of the performance exhibited by GPT-3.5. This suggests that with sufficient task-specific training data, adapter-based PEFT of smaller LLMs has the potential to achieve performance comparable to that of extremely large language models. The utilization of adapter-based PEFT yields superior performance by smaller language models compared to GPT-3.5 specifically in simpler tasks such as MultiArith, AddSub, and SingleEq. However, challenges persist in more complex tasks like GSM8K and SVAMP, which require a higher level of language comprehension and proficiency from the underlying base model, thereby resulting in a discernible performance gap. Regarding the different adapters employed, LoRA achieves remarkable performance while utilizing significantly fewer trainable parameters. This implies that excessive learnable parameters may not be necessary for task-specific fine-tuning. Overall, these findings demonstrate the potential for adapter-based PEFT of smaller LLMs to achieve high performance on specific tasks with few trainable parameters.

Table 5: An example randomly sampled from GSM8K. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

4.3 Commonsense Reasoning

Additionally, we assess the efficacy of various PEFT methods for commonsense reasoning tasks. The adapters undergo fine-tuning using the Commonsense170K dataset. Our baseline models for commonsense reasoning include GPT-3 (175B), PaLM (540B), and ChatGPT. The results for GPT-3 and PaLM can be found in the study by Touvron et al. (2023). To evaluate ChatGPT’s performance in commonsense reasoning, we employ the gpt-3.5-turbo API with a zero-shot CoT. The zero-shot CoT prompts align with the template used for collecting our commonsense fine-tuning dataset, as outlined in Appendix A.2. Table 4 presents the performance of the PEFT methods utilizing different LLMs alongside the baselines. Remarkably, LLaMA-13B with Series Adapter, Parallel Adapter, and LoRA outperform all the baselines, including ChatGPT, which has been hailed as the most impressive LLM to date. LLaMA-13B with Parallel Adapter achieves an average accuracy of 81.5%, representing a 4.5% improvement over ChatGPT. It is worth noting that all the training sets from the commonsense reasoning datasets are included in the fine-tuning data Commonsense170K. Furthermore, we observe that the performance of the PEFT methods is influenced by the underlying capabilities of the base models. LLaMA-7B and LLaMA-13B demonstrate superior commonsense reasoning abilities compared to the BLOOMz and GPT-J models.

4.4 ID and OOD Analysis

When comparing the performance of PEFT methods on math reasoning and commonsense reasoning tasks, we can observe that PEFT methods exhibit more remarkable results in the realm of commonsense reasoning. Moving forward, we will analyze the factors contributing to this phenomenon from both the in-distribution (ID) and out-of-distribution (OOD) perspectives. In the context of commonsense reasoning, the fine-tuning data set, Commonsense170K, encompasses all the training sets from the commonsense reasoning datasets. Notably, PEFT methods have demonstrated the ability to outperform ChatGPT. This observation implies that, by utilizing ID fine-tuning data, smaller language models like LLaMA-13B could surpass larger language models such as ChatGPT and PaLM in specific downstream tasks. However, when considering math reasoning tasks, the fine-tuning data set, Math10K, only includes the training sets of GSM8K and AQuA. In this regard, it has been observed that PEFT methods, particularly LLaMA-13B with LoRA, exhibit superior performance compared to GPT-3.5 on MultiArith, AddSub, and SingleEq. These findings suggest that PEFT methods can enhance the math reasoning abilities of LLMs and can be successfully applied to OOD datasets. Nonetheless, when evaluating the performance of PEFT methods on the ID datasets GSM8K and AQuA, a performance gap is still evident compared to GPT-3.5. This discrepancy is likely due to the higher complexity of GSM8K and AQuA datasets in terms of math reasoning, while the reasoning capabilities of smaller LLMs remain limited. Consequently, identifying strategies to improve the performance of PEFT methods on complex math reasoning tasks represents a potential avenue for future research.

5 Qualitative Study

The previous sections have presented the quantitative analysis. In this section, we will provide qualitative examples to demonstrate the quality of outputs from different models. Table 5 displays a randomly selected question from GSM8K along with the outputs of ChatGPT and LLaMA-13B models using various PEFT methods. More detailed examples can be found in Appendix A.5. ChatGPT demonstrates a comprehensive understanding of the question and generates two steps, "(36 * 2/3) = 24 square feet" and "(24 * 24) = 576 mosaic tiles," effectively solving the problem. However, the language understanding ability of LLaMA-13B-Prefix models is limited, leading LLaMA-13B-Prefix to take the wrong direction in the first step. On the other hand, LLaMA-13B with Series Adapter produces a high-quality answer by providing the crucial two steps and performing the correct calculations to obtain the accurate result. Interestingly, LLaMA-13B-Parallel and LLaMA-13B-LoRA generate almost identical rationales. However, LLaMA-13B-Parallel produces an incorrect answer due to a calculation error, stating "24 sq ft x 24 mosaic tiles per sq ft = 600 mosaic tiles". In general, when equipped with task-specific fine-tuning data, smaller language models like LLaMA-13B can generate impressive, high-quality answers that are comparable to those produced by ChatGPT.

6 Conclusion

In this paper, we develop a user-friendly framework, LLM-Adapter, seamlessly integrates diverse adapters into LLMs, empowering researchers to implement adapter-based PEFT methods for a wide range of tasks. To evaluate different PEFT methods on downstream tasks, we construct two high-quality fine-tuning datasets to enhance PEFT performance on math reasoning and commonsense reasoning tasks. By utilizing the LLM-Adapter toolkit and the constructed fine-tuning datasets, we conduct a comprehensive empirical study and find the answer of research questions on the optimal placement and configuration of different PEFT methods, the impact of adapter architectures, and the influence of ID and OOD scenarios. We hope this work will encourage further research on PEFT methods for LLMs.

7 Limitations

There are two limitations to this work. Firstly, due to constrained computing resources, we were unable to evaluate the performance of larger language models such as LLaMA-33B and LLaMA-65B. It is anticipated that these larger models, possessing enhanced language understanding capabilities, would yield superior performance. Secondly, this paper does not delve into the exploration of combining different adapters. Given the extensive search space associated with the combination of various PEFT methods, we intend to explore this direction in future research endeavors.

References

Aghajanyan et al. (2020) Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. 2020. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Annual Meeting of the Association for Computational Linguistics.
Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
Chen et al. (2023) Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821.
Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Edalati et al. (2022) Ali Edalati, Marzieh S. Tahaei, Ivan Kobyzev, V.Nia, James J. Clark, and Mehdi Rezagholizadeh. 2022. Krona: Parameter efficient tuning with kronecker adapter. ArXiv, abs/2212.10650.
Fu et al. (2021) Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, and Jishen Zhao. 2021. Learn-to-share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3469–3479. PMLR.
He et al. (2021) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2021. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
He et al. (2022a) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022a. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
He et al. (2022b) Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. 2022b. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2184–2190, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Henderson et al. (2021) James Henderson, Sebastian Ruder, et al. 2021. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems.
Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
Koncel-Kedziorski et al. (2016) Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MAWPS: A math word problem repository. In Proceedings of NAACL, pages 1152–1157.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. ArXiv, abs/2104.08691.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167.
Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
Mao et al. (2021) Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen tau Yih, and Madian Khabsa. 2021. Unipelt: A unified framework for parameter-efficient language model tuning. ArXiv, abs/2110.07577.
Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
OpenAI (2022) OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems?In Proceedings of NAACL, pages 2080–2094.
Pfeiffer et al. (2020) Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebastian Ruder. 2020. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In Conference on Empirical Methods in Natural Language Processing.
Qin et al. (2023) Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
Qin et al. (2021) Yujia Qin, Xiaozhi Wang, Yusheng Su, Yankai Lin, Ning Ding, Jing Yi, Weize Chen, Zhiyuan Liu, Juanzi Li, Lei Hou, et al. 2021. Exploring universal intrinsic task subspace via prompt tuning. arXiv e-prints, pages arXiv–2110.
Roy and Roth (2016) Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580.
Sung et al. (2022) Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. ArXiv, abs/2206.06522.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Vu et al. (2021) Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. 2021. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904.
Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
Wang et al. (2022) Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. 2022. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. ArXiv, abs/2205.12410.
Yunxiang et al. (2023) Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Zhang You. 2023. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.

Appendix A Appendix

A.1 Math Reasoning Prompt Templates

We utilize ChatGPT to collect the math reasoning data for fine-tuning. Table 6 show the prompt template used to query ChatGPT. The expression "Please give the steps" is employed to guide ChatGPT to generate reasoning steps, thus, we can use the rationale information to fine-tune adapters. "Give the arabic numerals as the answer." is utilized to guide ChatGPT to generate arabic numbers as the final answer making it easier to extract the answer from the outputs.

Table 6: The prompt template used to collect math reasoning dataset for fine-tuning. An example from GSM8K is also included.

A.2 Commonsense Data Templates

As each dataset in the commonsense reasoning domain entails distinct tasks, we adopt a structured template by initially describing the task’s goal, followed by the corresponding content and answer. Table 7 shows the templates used to collect commonsense reasoning data for fine-tuning.

Table 7: The data template of each dataset used to create commonsense reasoning data for fine-tuning.

A.3 Placement Analysis

Table 8 shows the performance regarding the placement of adapters in various locations on math reasoning datasets. The fine-tuning dataset utilized for this study is Math10K. Meanwhile, the base models employed is LLaMA-7B. We can observe that for the Series Adapter, the best position is to place it after the MLP layers, achieving an average accuracy of 59.5%percent 59.5 59.5%59.5 % on the math reasoning datasets. As for the Parallel Adapter, when we place it within the MLP layers, it achieves the best performance of 61.7%percent 61.7 61.7%61.7 %. Regarding LoRA, we need to insert it simultaneously into both the Multi-head Attention layers and MLP layers to achieve the best performance of 60%percent 60 60%60 %.

Table 8: An evaluation of the accuracy regarding the placement of adapters in various locations is conducted on math reasoning datasets. The fine-tuning dataset used for this analysis is Math10K. In this context, "Attn" refers to the multi-head attention layer, while "MLP" denotes the MLP layer. The base model employed for this study is LLaMA-7B.

A.4 Configuration Analysis

Table 9 shows the accuracy comparison regarding different settings of variable for PEFT methods on math reasoning datasets. The fine-tuning dataset used for this study is Math10K. It can be noted that when the number of virtual tokens in Prefix-Tuning is set to 10, Prefix-Tuning attains an average accuracy of 42.0%percent 42.0 42.0%42.0 % on math reasoning datasets. By configuring the bottleneck dimension to 256, Series and Parallel Adapter demonstrate the highest level of performance. However, when the bottleneck size is increased to 512, the accuracy of both Series and Parallel Adapter decreases. The typical setting for LoRA rank is set to 8, but we have discovered that a larger rank can enhance the performance of LoRA. Remarkably, when the rank is increased to 32, LoRA achieves an accuracy of 61.9%percent 61.9 61.9%61.9 %.

Table 9: The accuracy comparison regarding different settings of variable for PEFT methods on math reasoning datasets. The fine-tuning dataset used for this analysis is Math10K. In this context, "vt" refers to the number of virtual tokens, "bn" denotes the bottleneck size, while "r" is the LoRA rank. The base model employed for this study is LLaMA-7B.

A.5 Qualitative Examples

We will show examples randomly sampled from math reasoning and commonsense reasoning datasets in this section.

Table 10: An example randomly sampled from MultiArith. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 11: An example randomly sampled from GSM8K. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 12: An example randomly sampled from AddSub. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 13: An example randomly sampled from AQuA. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 14: An example randomly sampled from SingleEq. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 15: An example randomly sampled from SVAMP. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 16: An example randomly sampled from BoolQ. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 17: An example randomly sampled from PIQA. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 18: An example randomly sampled from SIQA. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 19: An example randomly sampled from . The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 20: An example randomly sampled from WinoGrande. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 21: An example randomly sampled from ARC-e. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 22: An example randomly sampled from ARC-c. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Table 23: An example randomly sampled from OBQA. The outputs of ChatGPT and LLaMA-13B with different PEFT methods.

Xet Storage Details

Size:: 58.6 kB
Xet hash:: e57804aa2a5035f41567775b05c0b3109f57dfc77745232c7f7bf6d444595f39

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.