86 kB

Title: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

URL Source: https://arxiv.org/html/2309.03883

Published Time: Tue, 12 Mar 2024 01:06:09 GMT

Markdown Content: Yung-Sung Chuang⋆†{}^{\dagger}{}^{\star}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT, Yujia Xie‡‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Hongyin Luo††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Yoon Kim††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, James Glass††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Pengcheng He‡‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT

††{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Massachusetts Institute of Technology, ‡‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT Microsoft

yungsung@mit.edu, yujiaxie@microsoft.com

{hyluo,yoonkim,glass}@mit.edu, herbert.he@gmail.com

Abstract

Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this D ecoding by C o ntrasting La yers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.1 1 1 The source code is available at https://github.com/voidism/DoLa.††⋆⋆{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT Work mainly done during an internship at Microsoft.

1 Introduction

Large language models (LLMs) have demonstrated great potential in numerous natural language processing (NLP) applications(Brown et al., 2020; OpenAI, 2022; 2023). However, despite the continued increase in performance and the emergence of new capabilities from scaling LLMs (Wei et al., 2022a), their tendency to “hallucinate”, i.e., generate content that deviates from real-world facts observed during pretraining(Ji et al., 2023), remains a persistent challenge. This represents a major bottleneck in their deployment especially for high-stakes applications (e.g., clinical/legal settings) where reliable generation of trustworthy text is crucial.

While the exact reasons for LMs’ hallucinations are not fully understood, a possible reason is due to the maximum likelihood language modeling objective which minimize the forward KL divergence between the data and model distributions. This objective potentially results in a model with mass-seeking behavior which causes the LM to assign non-zero probability to sentences that are not fully consistent with knowledge embedded in the training data. Empirically, an LM trained with the next-word prediction objective on finite data has been shown to result in a model that uses linguistic knowledge to recognize the superficial patterns, instead of recognizing and generating the real-world facts extracted from the training corpus(Ji et al., 2023).

From a model interpretability perspective, transformer LMs have been loosely shown to encode “lower-level” information (e.g., part-of-speech tags) in the earlier layers, and more “semantic” information in the later layers(Tenney et al., 2019). More recently, Dai et al. (2022) find that “knowledge neurons” are distributed in the topmost layers of the pretrained BERT model. Meng et al. (2022) show that factual knowledge can even be edited by manipulating a specific set of feedforward layers within an autoregressive LM. We propose to exploit this modular encoding of knowledge to amplify the factual knowledge in an LM through a contrastive decoding approach, where the output next-word probability is obtained from the difference in logits between a higher layer versus a lower layer. By emphasizing the knowledge of higher layers and downplaying that of lower layers, we can potentially make LMs more factual and thus reduce hallucinations.

An illustration of this idea for a simple example is shown in Figure1. While “Seattle” maintains high probability throughout all the layers—presumably because it is a syntactically plausible answer—the probability of the true answer “Olympia” increases after the higher layers inject more factual knowledge. Contrasting the differences between the different layers can thus reveal the true answer in this case. Based on this concept, we propose a new decoding method, D ecoding by C o ntrasting La yers (DoLa), for better surfacing factual knowledge embedded in an LLM without retrieving external knowledge or additional fine-tuning.

Figure 1: Illustration of an LLM progressively incorporates factual information along layers. While the next-word probabilities of “Seattle” remain similar throughout different layers, the probabilities of the correct answer “Olympia” gradually increase from lower to higher layers. DoLa uses this fact to decode by contrasting the difference between layers to sharpen an LLM’s probability towards factually correct outputs.

Experiments on TruthfulQA(Lin et al., 2022) and FACTOR Muhlgay et al. (2023) demonstrate that DoLa is able to increase the truthfulness of the models of the LLaMA family(Touvron et al., 2023). Further experiments on chain-of-thought reasoning for StrategyQA(Geva et al., 2021) and GSM8K(Cobbe et al., 2021) also show that it can facilitate more factual reasoning. Finally, experiments using GPT-4 for open-ended chatbot evaluation(Chiang et al., 2023) show that when compared with the original decoding method, DoLa can generate informative and significantly more factual responses that lead to better ratings from GPT-4. From an efficiency perspective, we find that DoLa causes only a small additional latency in the decoding process, suggesting it as a practical and useful decoding strategy for improving the truthfulness of LLMs.

2 Method

Recent language models consist of an embedding layer, N 𝑁 N italic_N stacked transformer layers, and an affine layer ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ ) for predicting the next-word distribtution. Given a sequence of tokens {x 1,x 2,…,x t−1}subscript 𝑥 1 subscript 𝑥 2…subscript 𝑥 𝑡 1{x_{1},x_{2},\dots,x_{t-1}}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }, the embedding layer first embeds the tokens into a sequence of vectors H 0={h 1(0),…,h t−1(0)}subscript 𝐻 0 superscript subscript ℎ 1 0…superscript subscript ℎ 𝑡 1 0 H_{0}={h_{1}^{(0)},\dots,h_{t-1}^{(0)}}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT }. Then H 0 subscript 𝐻 0 H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT would be processed by each of the transformer layers successively. We denote the output of the j 𝑗 j italic_j-th layer as H j subscript 𝐻 𝑗 H_{j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, the vocabulary head ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ ) predicts the probability of the next token x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over the vocabulary set 𝒳 𝒳\mathcal{X}caligraphic_X,

p⁢(x t∣x<t)=softmax⁢(ϕ⁢(h t(N)))x t,x t∈𝒳.formulae-sequence 𝑝 conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 softmax subscript italic-ϕ superscript subscript ℎ 𝑡 𝑁 subscript 𝑥 𝑡 subscript 𝑥 𝑡 𝒳\displaystyle p(x_{t}\mid x_{<t})=\mathrm{softmax}\bigl{(}\phi(h_{t}^{(N)})% \bigr{)}{x{t}},\quad x_{t}\in\mathcal{X}.italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = roman_softmax ( italic_ϕ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X .

Instead of applying ϕ italic-ϕ\phi italic_ϕ on the final layer, our approach contrasts the higher-layer and lower-layer information to obtain the next-token probability. More specifically, for the j 𝑗 j italic_j-th early layer, we also compute the next-token probability using ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ ) as follows, where 𝒥⊂{0,…,N−1}𝒥 0…𝑁 1\mathcal{J}\subset{0,\dots,N-1}caligraphic_J ⊂ { 0 , … , italic_N - 1 } is a set of candidate layers,

q j⁢(x t∣x<t)=softmax⁢(ϕ⁢(h t(j)))x t,j∈𝒥.formulae-sequence subscript 𝑞 𝑗 conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 softmax subscript italic-ϕ superscript subscript ℎ 𝑡 𝑗 subscript 𝑥 𝑡 𝑗 𝒥\displaystyle q_{j}(x_{t}\mid x_{<t})=\mathrm{softmax}\bigl{(}\phi(h_{t}^{(j)}% )\bigr{)}{x{t}},\quad j\in{\mathcal{J}}.italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = roman_softmax ( italic_ϕ ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_j ∈ caligraphic_J .

The idea of applying language heads directly to the hidden states of the middle layers, known as early exit(Teerapittayanon et al., 2016; Elbayad et al., 2020; Schuster et al., 2022), has proven to be effective even without special training process(Kao et al., 2020), as the residual connections(He et al., 2016) in transformer layers make the hidden representations gradually evolve without abrupt changes. Using q j⁢(x t)subscript 𝑞 𝑗 subscript 𝑥 𝑡 q_{j}(x_{t})italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to represent q j⁢(x t∣x<t)subscript 𝑞 𝑗 conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 q_{j}(x_{t}\mid x_{<t})italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) for notational brevity, we then compute the probability of the next token by,

p^⁢(x t∣x<t)^𝑝 conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡\displaystyle\hat{p}(x_{t}\mid x_{<t})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT )=softmax⁢(ℱ⁢(q N⁢(x t),q M⁢(x t)))x t,absent softmax subscript ℱ subscript 𝑞 𝑁 subscript 𝑥 𝑡 subscript 𝑞 𝑀 subscript 𝑥 𝑡 subscript 𝑥 𝑡\displaystyle=\mathrm{softmax}\bigl{(}\mathcal{F}\bigl{(}q_{N}(x_{t}),q_{M}(x_% {t})\bigr{)}\bigr{)}{x{t}},= roman_softmax ( caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , where M where 𝑀\displaystyle\text{where}\quad M where italic_M=arg⁢max j∈𝒥⁡d⁢(q N⁢(⋅),q j⁢(⋅)).absent subscript arg max 𝑗 𝒥 𝑑 subscript 𝑞 𝑁⋅subscript 𝑞 𝑗⋅\displaystyle=\operatorname*{arg,max}{j\in\mathcal{J}},,d\big{(}q{N}(% \cdot),q_{j}(\cdot)\bigr{)}.= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT italic_d ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( ⋅ ) , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ ) ) .

Here, layer M 𝑀 M italic_M is named premature layer, while the final layer, i.e., layer N 𝑁 N italic_N, is named mature layer. The operator ℱ⁢(⋅,⋅)ℱ⋅⋅\mathcal{F}(\cdot,\cdot)caligraphic_F ( ⋅ , ⋅ ), to be elaborated further in Section2.3, is used to contrast between the output distributions from the premature layer and the mature layer by computing the log-domain difference between two distributions. The premature layer is dynamically selected in each decoding step using a distributional distance measure d⁢(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) (we use Jensen-Shannon Divergence) between the mature layer and all the candidate layers in 𝒥 𝒥\mathcal{J}caligraphic_J. We discuss d⁢(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) in more detail in Section2.2. The motivation for selecting the layer with the highest distance d⁢(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is to ensure that the model would significantly change its output after that selected layer, and thus have a higher chance to include more factual knowledge that does not exist in the early layers before it.

Figure 2: JSD (scaled by 10 5 superscript 10 5 10^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT) between the final 32nd layer and even-numbered early layers. Column names are decoded tokens in each step. Row names are indices of the early layers. 0 means word embedding layer.

2.1 Factual Knowledge Evolves Across Layers

We conduct preliminary analysis with 32-layer LLaMA-7B(Touvron et al., 2023) to motivate our approach. We compute the Jensen-Shannon Divergence (JSD) between the early exiting output distributions q j(⋅∣x<t)q_{j}(\cdot\mid x_{<t})italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) and the final layer output distribution q N(⋅∣x<t)q_{N}(\cdot\mid x_{<t})italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ), to show how the early exiting outputs are different from the final layer outputs. Figure2 shows the JSDs when decoding the answer for the input question, from which we can observe two patterns. Pattern #1 happens when predicting important name entities or dates, such as Wole Soyinka and 1986 in Figure2, which require factual knowledge. We observe the calculated JSD would be still extremely high in the higher layers. This pattern indicates that the model is still changing its predictions in the last few layers, and potentially injecting more factual knowledge into the predictions. Pattern #2 happens when predicting function words, such as was, the, to, in, and the tokens copied from the input question, such as first Nigerian, Nobel Prize. When predicting these “easy” tokens, we can observe that the JSD becomes very small from middle layers. This finding indicates that the model has already decided what token to generate in middle layers, and keeps the output distributions almost unchanged in the higher layers. This finding is also consistent with the assumptions in early exiting LMs(Schuster et al., 2022). A preliminary analysis that can quantitatively support this observation is also shown in AppendixA.

Qualitatively, when the next-word prediction requires factual knowledge, LLaMA seems to to change the predictions in the higher layers. Contrasting the layers before/after a sudden change may therefore amplify the knowledge emerging from the higher layers and make the model rely more on its factual internal knowledge. Moreover, this evolution of information seems to vary token by token. Our method requires accurately selecting the premature layer that contains plausible but less factual information, which may not always stay in the same early layer. Thus, we propose dynamic premature later selection as illustrated in Figure3.

Figure 3: The illustration of how dynamic premature layer selection works.

2.2 Dynamic Premature Layer Selection

To magnify the effectiveness of contrastive decoding, the optimal premature layer should ideally be the layer most different from the final-layer outputs. To allow for dynamic premature layer selection at each time step, we adopt the following measure of distance between the next-word distributions obtained from two layers,

d(q N(⋅|x<t),q j(⋅|x<t))=JSD(q N(⋅|x<t)||q j(⋅|x<t)),\displaystyle d\big{(}q_{N}(\cdot,|,x_{<t}),q_{j}(\cdot,|,x_{<t})\bigr{)}=% \text{JSD}\bigl{(}q_{N}(\cdot,|,x_{<t})||q_{j}(\cdot,|,x_{<t})\bigr{)},italic_d ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) = JSD ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) | | italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) ,

where JSD⁢(⋅,⋅)JSD⋅⋅\text{JSD}(\cdot,\cdot)JSD ( ⋅ , ⋅ ) is the Jensen-Shannon divergence. The premature layer, i.e., the M 𝑀 M italic_M-th layer (0≤M<N 0 𝑀 𝑁 0\leq M<N 0 ≤ italic_M < italic_N), is then selected as the layer with the maximum divergence among the subset of early layers,

M=arg max j∈𝒥 JSD(q N(⋅|x<t)||q j(⋅|x<t)),M=\arg\max_{j\in\mathcal{J}}\text{JSD}\bigl{(}q_{N}(\cdot,|,x_{<t})||q_{j}(% \cdot,|,x_{<t})\bigr{)},italic_M = roman_arg roman_max start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT JSD ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) | | italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) ,

where 𝒥 𝒥\mathcal{J}caligraphic_J is a set of candidate layers for premature layer selection. For LLaMA models with various number of layers, we divide the layers into 2 to 4 buckets of 𝒥 𝒥\mathcal{J}caligraphic_J based on their total layers, in order to focus on contrasting from a certain range of layers. The best bucket for each task is chosen using a validation set, as detailed in Section3.1. This dynamic layer selection strategy enables the the selection of suitable premature layers based on token difficulty, thereby making better use of the knowledge learned by different layers.

Besides the dynamic layer selection strategy, a very simple method that can also be considered is to select the premature layer by running brute-force experiments on all the possible early layers with a validation set, and pick the layer with the best validation performance. We refer to this simple method as DoLa-static. However, DoLa-static has the drawbacks of 1) requiring more hyperparameter search runs in layers and the fact that 2) best layers are sensitive to data distribution, thus requiring in-distribution validation sets. Our proposed dynamic layer selection strategy also mitigates the drawbacks of DoLa-static by shrinking the layer search space and making the method more robust without heavily relying on in-distribution validation sets. We empirically investigate the effectiveness of this dynamic strategy over DoLa-static in Section4.1.

2.3 Contrasting the Predictions

Given the premature and mature layers obtained from Section2.2, we aim to amplify mature layer outputs while downplaying premature layer outputs. Following the Contrastive Decoding approach from Li et al. (2022), we subtract the log probabilities of the premature layer outputs from those of the mature layer. We then use this resulting distribution as the next-word prediction, as illustrated in Figure1,

p^⁢(x t∣x<t)=softmax⁢(ℱ⁢(q N⁢(x t),q M⁢(x t)))x t,where^𝑝 conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 softmax subscript ℱ subscript 𝑞 𝑁 subscript 𝑥 𝑡 subscript 𝑞 𝑀 subscript 𝑥 𝑡 subscript 𝑥 𝑡 where\displaystyle\hat{p}(x_{t}\mid x_{<t})=\mathrm{softmax}\bigl{(}\mathcal{F}% \bigl{(}q_{N}(x_{t}),q_{M}(x_{t})\bigr{)}\bigr{)}{x{t}},\quad\text{where}over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = roman_softmax ( caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , where ℱ⁢(q N⁢(x t),q M⁢(x t))={log⁡q N⁢(x t)q M⁢(x t),if⁢x t∈𝒱 head⁢(x t|x<t),−∞,otherwise.ℱ subscript 𝑞 𝑁 subscript 𝑥 𝑡 subscript 𝑞 𝑀 subscript 𝑥 𝑡 cases subscript 𝑞 𝑁 subscript 𝑥 𝑡 subscript 𝑞 𝑀 subscript 𝑥 𝑡 if subscript 𝑥 𝑡 subscript 𝒱 head conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 otherwise.\displaystyle\mathcal{F}\bigl{(}q_{N}(x_{t}),q_{M}(x_{t})\bigr{)}=\begin{cases% }\log\dfrac{q_{N}(x_{t})}{q_{M}(x_{t})},&\text{ if }x_{t}\in\mathcal{V}{\text% {head }}\left(x{t}|x_{<t}\right),\ -\infty,&\text{ otherwise. }\end{cases}caligraphic_F ( italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = { start_ROW start_CELL roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG , end_CELL start_CELL if italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT head end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL - ∞ , end_CELL start_CELL otherwise. end_CELL end_ROW

Similar to Li et al. (2022), the subset 𝒱 head⁢(x t|x<t)∈𝒳 subscript 𝒱 head conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 𝒳\mathcal{V}{\text{head }}\left(x{t}|x_{<t}\right)\in\mathcal{X}caligraphic_V start_POSTSUBSCRIPT head end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_X is defined as whether or not the token has high enough output probabilities from the mature layer,

𝒱 head⁢(x t|x<t)={x t∈𝒳:q N⁢(x t)≥α⁢max w⁡q N⁢(w)}.subscript 𝒱 head conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 conditional-set subscript 𝑥 𝑡 𝒳 subscript 𝑞 𝑁 subscript 𝑥 𝑡 𝛼 subscript 𝑤 subscript 𝑞 𝑁 𝑤\mathcal{V}{\text{head }}\left(x{t}|x_{<t}\right)=\left{x_{t}\in\mathcal{X}% :q_{N}(x_{t})\geq\alpha\max_{w}q_{N}(w)\right}.caligraphic_V start_POSTSUBSCRIPT head end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X : italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_α roman_max start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_w ) } .

If the predicted probability of a token is too small in the mature layer, it is not likely to be a reasonable prediction, so we set the token probability to zero to minimize false positive and false negative cases. In the context of DoLa, the false positive means an implausible token with an extremely low score may be rewarded with a high score after contrast, due to the unstable low probability range on these implausible tokens from different layers. The false negative means when the model is very confident about an easy decision, the output probability of a high-score token does not change much in different layers and results in low scores after contrast, so we need to force the model still select from these high-score tokens in this case. This strategy is referred as an adaptive plausibility constraint (APC) proposed in Li et al. (2022).

Repetition Penalty. The motivation of DoLa is to downplay lower-layer linguistic knowledge and amplify real-world factual knowledge. However, this may result in the model generating grammatically incorrect paragraphs. Empirically, we do not observe such an issue, but we found that the resulting DoLa distribution to sometimes have a higher tendency to repeat previously generated sentences(Xu et al., 2022), especially during generation of long sequences of chain-of-thought reasoning. Here we include a simple repetition penalty introduced in Keskar et al. (2019) with θ=1.2 𝜃 1.2\theta=1.2 italic_θ = 1.2 during decoding. The empirical analysis of the repetition penalty is shown in AppendixK.

3 Experiments

3.1 Setup

Datasets. We consider multiple choices and open-ended generation tasks. For multiple choices, we use TruthfulQA(Lin et al., 2022) and FACTOR (News/Wiki)(Muhlgay et al., 2023) to assess LMs’ factuality in short-answer/long-paragraph settings, respectively. For open-ended generation, we use TruthfulQA (rated by fine-tuned GPT-3)(Lin et al., 2022) and tasks involving chain-of-thought(Wei et al., 2022b) reasoning: StrategyQA(Geva et al., 2021) and GSM8K Cobbe et al. (2021). Finally, we test Vicuna QA(Chiang et al., 2023) which uses GPT-4 to evaluate instruction-following abilities as chatbot assistants.

Models and Baselines. We examine four sizes of LLaMA models(Touvron et al., 2023) (7B, 13B, 33B, 65B) and compare them with three baselines: 1) original decoding (greedy decoding or sampling depending on the tasks), 2) Contrastive Decoding (CD)(Li et al., 2022), where LLaMA-7B serves as the amateur model and LLaMA-13B/33B/65B act as expert models, and 3) Inference Time Intervention (ITI). ITI uses LLaMA-7B and a linear classifier trained on TruthfulQA. Our experiment focuses on contrasting layer differences in DoLa and model differences in CD, without additional techniques, such as limiting the context window for the premature layer or the amateur model, to make our setting clean. We set adaptive plausibility constraint (α 𝛼\alpha italic_α) to 0.1 and repetition penalty (θ 𝜃\theta italic_θ) to 1.2 as per prior studies(Li et al., 2022; Keskar et al., 2019).

Candidate Layers. In dynamic premature layer selection, we partition transformer layers into buckets and select one bucket as candidate layers (𝒥 𝒥\mathcal{J}caligraphic_J). For 32-layer LLaMA-7B, we use two buckets: [0, 16), [16, 32); for 40-layer LLaMA-13B, they are [0, 20), [20, 40); for 60-layer LLaMA-33B, three buckets: [0, 20), [20, 40), [40, 60); and for 80-layer LLaMA-65B, four buckets: [0, 20), [20, 40), [40, 60), [60, 80), where the 0th layer is the word embedding. This design limits the hyperparameter search space to only 2-4 validation runs. For efficiency, only even-indexed layers (0th, 2nd, etc.) are considered as candidates. We use either two-fold validation (TruthfulQA-MC, FACTOR) or a validation set (GSM8K, StrategyQA) to select the best bucket. For Vicuna QA, which lacks a validation set, we use GSM8K’s best bucket.

Table 1: Experimental results on 1) multiple choices dataset: TruthfulQA and FACTOR and 2) open-ended generation tasks: TruthfulQA and Chain-of-Thought (CoT) reasoning tasks, including StrategyQA (StrQA) and GSM8K. %T∗normal-∗\ast∗I stands for %Truth∗normal-∗\ast∗Info in TruthfulQA.

3.2 Multiple Choices

Short-Answer Factuality. We test TruthfulQA with the default QA prompt from Lin et al. (2022) and Li et al. (2023). For α 𝛼\alpha italic_α in APC, we replace −∞-\infty- ∞ with −1000 1000-1000- 1000 to avoid ruining LM likelihood scores, which also applies to FACTOR. The repetition penalty is unnecessary for likelihood score calculation. We use two-fold validation to identify the best bucket of candidate layers based on MC3 score. Results in Table1 show significant performance improvement for LLaMA models in four sizes, outperforming ITI/CD and confirming the effectiveness of DoLa. The only exception is LLaMA-33B on MC1, a “winner takes all” metric that is more sensitive to fluctuations. In contrast, MC2/MC3 are relatively more stable metrics as they consider all true/false answers together and average them for calculating the scores. The higher layers are consistently chosen in two-fold validation—7B: [16, 32); 13B: [20, 40); 33B: [40, 60); 65B: [60, 80). Implementation details and extra results of contrasting with the 0-th layer / all layers are shown in AppendixC.

Long-Paragraph Factuality. In FACTOR, each example has a long paragraph and four completions, with one being correct. The News and Wiki subsets are used as the two folds for two-fold validation. Table1 shows DoLa outperforms baselines by 2-4%, and is more effective than CD, except for 13B on Wiki. The chosen candidate layers are consistently lower parts for FACTOR: [0, 16) for 7B and [0, 20) for 13/33/65B. This differs from TruthfulQA, which selects higher layers. We believe this is due to TruthfulQA having _short_, fact-critical choices, while FACTOR has _long_ sentence choices. As noted in Section2.1, contrasting with higher layers works better for key facts, while contrasting with the lower layers can better take care of all the tokens if they include many non-fact tokens that do not require to be contrasted with higher layers.

Figure 4: Vicuna QA results of LLaMA vs LLaMA+DoLa, judged by GPT-4. Left: Total scores. Right: Win/tie/loss times of LLaMA+DoLA compared against LLaMA.

3.3 Open-Ended Text Generation

Short-Answer Factuality. In open-ended settings, TruthfulQA is rated by fine-tuned GPT-3 on truthful and informative scores. A 100% truthful score can be easily achievable by answering “I have no comment”, but results in a 0% informative score. We use the default QA prompt as in Lin et al. (2022) and Li et al. (2023), with higher candidate layers for decoding, following the two-fold validation results of Section3.2. Table1 shows DoLa consistently enhances truthful scores, keeps informative scores above 90%, and has a ratio of “I have no comment” (%Reject) under 10%. It improves the overall (%Truth∗∗\ast∗Info) scores by 12-17% across four models, reaching the performance level of ITI, which relies on supervised training with labels.

CD boosts truthfulness but often refuses to answer, generating ”I have no comment,” – over 60% of the time for the LLaMA-33B model – thus lowering its %Truth∗∗\ast∗Info score. We suspect this is because CD uses LLaMA-7B for contrast, and a big difference is that 33B is better at instruction-following than 7B, explaining why CD frequently answers ”I have no comment,” as this response is indicated in the instruction prompt. Our method consistently outperforms CD in final %Truth∗∗\ast∗Info scores.

Chain-of-Thought Reasoning. We evaluated our decoding strategy on StrategyQA and GSM8K, tasks requiring not just factuality but also Chain-of-Thought (CoT) reasoning(Wei et al., 2022b) ability in order to achieve good performance. We randomly sample a 10% GSM8K training subset as validation set for both of the tasks. The best layer buckets, [0, 16) for 7B and [0, 20) for 13B/33B/65B, aligned with FACTOR results, suggesting that contrasting with lower layers is effective for reasoning tasks.

•StrategyQA requires multi-hop CoT reasoning(Wei et al., 2022b). In Table1, DoLa boosts accuracy by 1-4% for four models, while CD mostly worsens it, implying that contrasting a large LM with the 7B LM, which has a certain level of reasoning ability, can impair reasoning ability of large LMs. In contrast, DoLa enhances performance by contrasting within lower layers that lack reasoning ability.
•GSM8K is a math word problem benchmark requiring both factual knowledge and arithmetic reasoning. Table1 shows a 2% accuracy improvement for most LLaMA sizes, except 7B. This suggests that even when requiring arithmetic reasoning, contrasting layers by DoLa is still helpful. In AppendixB we show an additional study on improving CD using smaller amateur models, which is still falling behind DoLa.

Instruction Following. Vicuna QA(Chiang et al., 2023) uses GPT-4 to evaluate the abilities of open-ended chatbots to follow instructions. Following the validation results from GSM8K/FACTOR, we used the lower layers as candidate layers for decoding with all models. Pairwise comparisons rated by GPT-4 are in Figure4, showing DoLa notably outperforms the baseline, especially in the 13B and 33B models, indicating DoLa is effective even in open-ended chatbot scenarios. Examples of qualitative studies are shown in AppendixM.

4 Analysis

4.1 Premature Layer Selection Strategy

We introduce a variant of DoLa, DoLa-static, which selects a constant layer for contrasting throughout the decoding process. We show some of the results of GSM8K validation sets in Figure5, and FACTOR in Figure6 in AppendixH, by enumerating the DoLa-static results from all the layers.

In Figure5 (left), DoLa-static performs better by contrasting lower layers. Some “optimal” layers, like the 10th layer, even outperform DoLa. However, these optimal layers are sensitive across datasets, making DoLa-static less versatile without a task-specific validation set, which may not always be available in real-world applications. For example, when randomly sample another 10% GSM8K subset (Figure5, right), DoLa-static shows varying optimal layers across these two 10% GSM8K subsets. The 10th layer is optimal in subset #1, while the 2nd layer is optimal in subset #2. Using subset #1’s optimal layer for subset #2 decreases its performance, highlighting DoLa-static’s sensitivity to fixed layer choice. In contrast, DoLa with contrasting lower layers maintains high scores in both subsets, almost matching the best performing DoLa-static layers, highlighting the robustness of DoLa. Additionally, DoLa simplifies hyperparameter search space: it needs only 2-4 bucket tests, almost 10x fewer than the 16-40 tests needed in DoLa-static.

We include another analysis on the optimality of our dynamic layer selection strategy in AppendixJ. Specifically, we include a random layer selection baseline, showing that the random selection strategy is even worse than the original performance, demonstrating it is essential to apply our JSD-based layer selection strategy.

Figure 5: LLaMA-7B on GSM8K validation sets with DoLa/DoLa-static using different premature layers. Left: subset#1. Right: subset #2.

4.2 Latency & Throughput

The greedy decoding latency in Table2 shows DoLa increases the decoding time by factors of 1.01 to 1.08, suggesting DoLa can be widely applied with negligible cost. The memory analysis/inference details are shown in AppendixE/F.

Table 2: Decoding latency (ms/token) and throughput (token/s).

4.3 Qualitative Study

In Table3, we show TruthfulQA examples generated deterministically via greedy decoding from LLaMA-33B, with truth/info scores by fine-tuned GPT-3. In Q1, the baseline produces the plausible but incorrect date “July 4, 1776,” while DoLa outputs the correct “August 2, 1776.” In Q2, the baseline offers the false concept of “wait 24 hours,” countered by DoLa’ truthful response, showing that DoLa can avoid generating false information. Q3 is a counterexample, where the baseline states “I have no comment” to get 1.0/0.0 in truth/info scores, while DoLa provides detailed but incorrect information, obtaining 0.0/1.0 in truth/info scores. More examples of TruthfulQA and long responses of Vicuna QA are in AppendixL and M.

Besides qualitative study, we also evaluate the text generation quality of DoLa by leveraging GPT-4, and the results are shown in AppendixD. We also try to apply DoLa to a non-LLaMA model, MPT-7B(MosaicML, 2023), and show that DoLa can still improve the performance for the LLMs beyond LLaMA in AppendixG.

Table 3: Qualitative study using LLaMA-33B baseline vs LLaMA-33B+DoLa on TruthfulQA.

5 Related Work

Hallucinations in LLMs. Hallucinations in LLMs refer to generated content not based on training data or facts, caused by various factors like imperfect learning and decoding(Ji et al., 2023). Ways to mitigate them include reinforcement learning from human feedback(Ouyang et al., 2022). Recent strategies involve inference-time self-consistency checks(Manakul et al., 2023), multi-agent debating(Du et al., 2023; Liang et al., 2023), and inference-time intervention using human labels(Li et al., 2023).

NLP Pipeline in Transformer. A study by Tenney et al. (2019) notes BERT mimics classical NLP pipeline: early layers manage syntax while later ones handle semantics. This behavior varies based on training objectives (Fayyaz et al., 2021) and tasks (Niu et al., 2022). Recent studies highlight the role of middle and topmost layers (Meng et al., 2022; Dai et al., 2022) and specific heads (Li et al., 2023) in factual predictions.

Contrastive Decoding. Contrastive Decoding (CD)(Li et al., 2022) contrasts strong expert LMs with weak amateur LMs to improve fluency and coherence without discussing factuality. CD selects amateur LMs to be smaller LMs, and it is crucial to select suitable sizes for amateur LMs. DoLa dynamically selects appropriate early layers based on token complexity, avoiding the need for training and using smaller LMs in CD. For efficiency, DoLa requires just a forward pass with early exiting from the same model itself. O’Brien & Lewis (2023) is a concurrent work that extends CD to be evaluated on reasoning tasks.

Following the concept of CD, Shi et al. (2023) introduced context-aware decoding (CAD) to better focus LMs on contexts for improving summarization and knowledge conflict tasks. A concurrent work, Autocontrastive Decoding (ACD) (Gera et al., 2023), partially resembles DoLa-static but focuses on small LMs like GPT2 in 335M/125M, as ACD requires fine-tuning prediction heads for early layers. Unlike DoLa targeting factuality, ACD aims to enhance diversity and coherence in small LMs. Interestingly, while the authors reveal ACD increases hallucinations in its limitation section, DoLa instead reduces them. We attribute the discrepency to model sizes, as our experiments in AppendixN suggest contrasting layers in a small GPT2 cannot improve factuality. Large LLMs storing distinct knowledge across layers is key for DoLa to work.

6 Conclusion and Limitations

In this paper, we introduce Decoding by Contrasting Layers (DoLa), a novel decoding strategy aimed at reducing hallucinations in LLMs. Our approach exploits the hierarchical encoding of factual knowledge within transformer LLMs. Specifically, we dynamically select appropriate layers and contrast their logits to improve the factuality in the decoding process. Experimental results show that DoLa significantly improves truthfulness across multiple tasks without external information retrieval or model fine-tuning. Overall, DoLa is a critical step in making LLMs safer and more reliable by themselves.

DoLa also has limitations: 1) Focusing on factuality: We have not explored DoLa in other dimensions such as reinforcement learning from human feedback(Ouyang et al., 2022). 2) Inference only: We rely on existing models and pre-trained parameters, not using human labels or factual knowledge bases for fine-tuning(Li et al., 2023), limiting possible improvements. 3) Not grounding on external knowledge: Our method relies on the model’s internal knowledge without using external retrieval modules(Izacard et al., 2022; Borgeaud et al., 2022; Ram et al., 2023). Thus, it cannot correct misinformation acquired during training. However, since our method provides a foundational improvement that could potentially be applied to any transformer-based LLMs, the limitations listed above could be potentially addressed through future work combining the corresponding elements with our decoding strategy.

Acknowledgements

We thank all the anonymous reviewers for their helpful discussions and insightful feedback. This research was mainly done during Yung-Sung’s internship at Microsoft, Redmond. Yung-Sung is sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein.

References

Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Chiang & Lee (2023a) Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023a.
Chiang & Lee (2023b) Cheng-Han Chiang and Hung-yi Lee. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657, 2023b.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8493–8502, 2022.
Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.
Elbayad et al. (2020) Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In ICLR 2020-Eighth International Conference on Learning Representations, pp. 1–14, 2020.
Fayyaz et al. (2021) Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. Not all models localize linguistic knowledge in the same place: A layer-wise probing on bertoids’ representations. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 375–388, 2021.
Geng & Liu (2023) Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
Gera et al. (2023) Ariel Gera, Roni Friedman, Ofir Arviv, Chulaka Gunasekara, Benjamin Sznajder, Noam Slonim, and Eyal Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10406–10420, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.580. URL https://aclanthology.org/2023.acl-long.580.
Geva et al. (2021) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
Kao et al. (2020) Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, and Hung-Yi Lee. Bert’s output layer recognizes all hidden layers? some intriguing phenomena and a simple way to boost bert. arXiv preprint arXiv:2001.09309, 2020.
Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
Li et al. (2022) Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022.
Liang et al. (2023) Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, 2022.
Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
MosaicML (2023) NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
Muhlgay et al. (2023) Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908, 2023.
Niu et al. (2022) Jingcheng Niu, Wenjie Lu, and Gerald Penn. Does bert rediscover a classical nlp pipeline? In Proceedings of the 29th International Conference on Computational Linguistics, pp. 3143–3153, 2022.
O’Brien & Lewis (2023) Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in large language models. arXiv preprint arXiv:2309.09117, 2023.
OpenAI (2022) OpenAI. Introducing chatgpt, November 2022. URL https://openai.com/blog/chatgpt.
OpenAI (2023) OpenAI. Gpt-4 technical report. 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
Sang & De Meulder (2003) Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147, 2003.
Schuster et al. (2022) Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
Shi et al. (2023) Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. IEEE, 2016.
Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601, 2019.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
Xia et al. (2023) Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
Xu et al. (2022) Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35:3082–3095, 2022.

Appendix A Preliminary Quantitative Study to support Figure2

We include an additional study to quantitatively support the claim we made from the observation in Figure2. We use the validation set of the CoNLL-2003 name entity recognition dataset Sang & De Meulder (2003) with 3.25K examples.2 2 2https://huggingface.co/datasets/conll2003 We calculate which layer has the largest JS-divergence with the final layer when LLaMA-7B predicts the next token with teacher forcing (we simply call this layer the “critical layer” for short). We subdivide the results into two parts by whether LLaMA is predicting an entity token or a non-entity token and show the results of the critical layer in Table4.

From Table4, we can find that 75% of the time the critical layer will be layer 0 when predicting non-entity tokens. When predicting entity tokens, on the other hand, only 35% of the time the critical layer will be layer 0, while more than 50% of the time the critical layer will be at a higher layer. This experiment can quantitatively support our observations in Figure2.

Note that we use teacher forcing to send the ground truth into LLaMA to predict the next word for each token in the sentence. And the ground truth sentences are not generated by LLaMA. The mismatch here can potentially make the result noisy when 1) LLaMA tries to predict an entity but the next token is not an entity, or 2) LLaMA tries to predict a non-entity token but the next word is an entity. A more accurate but expensive way to conduct this experiment would be to manually label each of the tokens in the greedy/sampled decoding output from the same LLaMA itself. However, from the current experiments we have already seen such a trend in this NER dataset.

Table 4: The distribution of critical layer in LLaMA-7B using the CoNLL 2003 NER dataset.

Appendix B Exploration in Contrastive Decoding Baseline: GSM8K

We explore the possibility of using smaller amateur models for contrastive decoding (CD)(Li et al., 2022) to create better baselines. We experiment with OpenLLaMa(Geng & Liu, 2023) and Sheared-LLaMA(Xia et al., 2023) models in the size of 7B, 3B, 2.7B, 1.3B. The results are shown in Table5. We can see that using a small amateur LM, especially the 1.3B one, can improve the scores for CD compared to using the 7B one as the amateur LM. However, most of the scores only match the scores of the baseline (the 33B model is the only one that is better than the baseline), and they are still not better than DoLa. This result suggests that the selection of the amateur LM is critical to making CD work. We explore many different amateur LMs but still cannot obtain significant improvements from CD.

Table 5: Exploration of the contrastive decoding baselines with different size of amateur models on the task of GSM8K.

Appendix C TruthfulQA Details & Scores for Contrasting with the Word Embedding Layer / All Layers

When implementing DoLa for TruthfulQA, we found that not applying the softmax function on top of ℱ ℱ\mathcal{F}caligraphic_F (defined in Section2) can make the performance even better as shown in Table6, so we stuck with this implementation for (and only for) the TruthfulQA multiple choices setting. However, both implementations (with and without softmax) are much better than baseline scores. We did not observe the same phenomenon on other datasets.

Table 6: The scores of DoLa on TruthfulQA multiple choices setting with and without post-softmax applied on top of ℱ ℱ\mathcal{F}caligraphic_F (defined in Section2).

We also include the analysis of applying DoLa on TruthfulQA with two variants of DoLa: 1) only contrasting with the word embedding (0-th) layer, and 2) contrasting with all the early even-numbered layers dynamically. The results are shown in Table7. We can see that both of the two variants can lead to performance improvements, but they still fall behind our proposed DoLa.

Method LLaMA-7B LLaMA-13B MC1 MC2 MC3 MC1 MC2 MC3 Vanilla 25.6 40.6 19.2 28.3 43.3 20.8 DoLa 0-th layer 31.6 61.7 30.1 28.5 62.3 30.2 DoLa all layers 32.0 63.9 31.2 30.5 62.3 31.0 DoLa 32.2 63.8 32.1 28.9 64.9 34.8 Method LLaMA-33B LLaMA-65B MC1 MC2 MC3 MC1 MC2 MC3 Vanilla 31.7 49.5 24.2 30.8 46.9 22.7 DoLa 0-th layer 31.4 61.1 31.1 31.0 63.6 31.2 DoLa all layers 29.1 61.5 30.7 30.5 62.0 31.7 DoLa 30.5 62.3 34.0 31.1 64.6 34.3

Table 7: The scores on TruthfulQA of DoLa contrasting with the 0-th (word embedding) layer and all the early even-numbered layers.

Appendix D GPT-4 Evaluation on Text Generation Quality

We conduct an additional study of the quality of generated text using GPT4, given the fact that several prior studies Chiang & Lee (2023a); Liu et al. (2023) have shown the great potential of GPT-4 to serve as an alternative to human evaluation. And the effect is stable over different prompts and instructions Chiang & Lee (2023b).

We adopt the pairwise evaluation code from Vicuna QA 3 3 3https://github.com/lm-sys/vicuna-blog-eval/tree/main/eval. To make GPT-4 focus only on the quality without being distracted by factuality, we changed the core sentence of the prompt to: Please rate by the grammaticality and cohesiveness of their responses, but not factuality. You are not required to verify the factual accuracy of the answers. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better quality.

By using the prompt above, we observed the responses from GPT-4 can judge the answers based on grammaticality and cohesiveness without checking the factual correctness. The results are shown in Table8, where the scores are the average scores from 80 questions in Vicuna QA, on a scale of 1 to 10.

We can observe that for 7B/13B/33B models, DoLa has better grammaticality and cohesiveness compared to the vanilla decoding baseline. For the largest 65B model, DoLa achieves a score that is almost the same as vanilla decoding. We conclude that when evaluating text generation quality without considering factuality, DoLa is still on par with (65B) or better than (7B/13B/33B) vanilla decoding.

Table 8: GPT-4 evaluation on text generation quality on a scale of 1 to 10, averged over the 80 examples in Vicuna QA.

Appendix E Memory Overhead

To measure the overhead, we calculate (a)𝑎(a)( italic_a ) the occupied GPU memory before the first forward pass and (b)𝑏(b)( italic_b ) the peak GPU memory during the forward passes. And then we can compute the memory overhead by (b)−(a)𝑏 𝑎(b)-(a)( italic_b ) - ( italic_a ), or the proportion of overhead (b)−(a)delimited-[]𝑏 𝑎 𝑎\frac{[(b)-(a)]}{(a)}divide start_ARG [ ( italic_b ) - ( italic_a ) ] end_ARG start_ARG ( italic_a ) end_ARG in %. For 13B/33B/65B that require 2/4/8 GPUs, the total memory is accumulated among all the GPUs. The results are shown in Table9.

We can see that during the forward pass of LLaMA-7B, the overhead for vanilla decoding is 2.5% while DoLa requires 3.6%. There is only 1.1% difference for the memory overhead between Vanilla and DoLa. For 13b/30b/65b models, the difference is even smaller than 1%. This result shows that the difference in memory overhead between DoLa and the vanilla decoding baseline is still negligible.

Table 9: Memory overhead of inference for 4 LLaMA models.

Appendix F Inference Details

We run all the experiments with NVIDIA V100 GPUs on the machines equipped with 40-core CPUs of Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHZ. We use the Huggingface Transformers package 4 4 4https://github.com/huggingface/transformers to conduct experiments. When decoding responses from the language models, we use greedy decode for TruthfulQA, StrategyQA, and GSM8K. For the Vicuna QA Benchmark, we use random sampling with temperature 0.7 and max new tokens 1024 to generate the responses.

For the latency and throughput analysis in Section4.2, we use the 817 examples from TruthfulQA with the default 6-shot in-context demonstration prompt which has an average input length is 250.3 after concatenating the prompt with the questions. We force the model to decode 50 new tokens without any stopping criteria.

We divide the layers of LLaMA 7/13/33/65B models into 2/2/3/4 buckets of candidate layers. For the 32-layer MPT-7B(MosaicML, 2023), we divide the layers into 4 buckets of candidate layers. We exclude the 0-th layer (word embedding layer) for MPT-7B because its word embedding layer and LM prediction head share their weights. Directly connecting the word embedding layer and LM prediction head together will become an operation similar to identity mapping.

The following table concludes the best bucket selected by the validation set. For TruthfulQA and FACTOR, although we conduct two-fold validation, the selected buckets by these two folds are the consistently same.

Table 10: Best Bucket Selected by Validation Set

Appendix G Non-LLaMA Model

To check if DoLa works beyond LLaMA models, we tested MPT-7B(MosaicML, 2023). Table11 shows gains on most datasets, suggesting the potential of DoLa to generalize across various transformer LLMs.

Table 11: Experiments of DoLa with MPT-7B.

Appendix H Static vs Dynamic Premature Layer Selection on FACTOR

In Figure6, we show the additional examples on FACTOR-News to compare the performance of DoLa and DoLa-static, for the four LLaMA models.

(a) LLaMA-7B.

(b) LLaMA-13B.

(d) LLaMA-65B.

Figure 6: DoLa vs DoLa-static with different premature layers on FACTOR-News.

Appendix I Scores for DoLa-static with Validation Selected Premature Layers

Besides the visualized comparisons, we also compare the scores of DoLa and DoLa-static in Table12, 13, 14. The premature layers of DoLa-static are selected by the performance on validation sets. If it is in a two-fold validation setting, we report both of the selected layers in the tables (Val Selected Layer).

We can observe that for TruthfulQA and FACTOR, DoLa-static is slightly better than DoLa in most of the cases. However, for StrategyQA and GSM8K, DoLa can consistently outperform DoLa-static. Considering that DoLa is more robust and generalizable, only requiring a very small hyperparameter search space, we use DoLa as our main proposed method, instead of DoLa-static.

Table 12: Multiple choices results on TruthfulQA. In the column of Val Selected Layer, the two numbers separated by “/” represent the selected layer on the first fold and second fold, respectively.

Table 13: Multiple choices results on FACTOR. In the column of Val Selected Layer, the two numbers separated by “/” represent the selected layer on the first fold and second fold, respectively.

Table 14: Chain-of-thought reasoning results on StrategyQA and GSM8K.

Table 15: Multiple choices results on the FACTOR dataset.

Appendix J Random Layer Selection Baseline

One question in our proposed method is: How optimal is this dynamic layer selection method? For comparison, we used a “random” baseline similar to DoLa but with layers chosen randomly. Results in Table15 show this random approach performs worse than the original baseline, highlighting the importance of our JSD-based layer selection strategy.

Appendix K The Effects of Repetition Penalty

Figure 7: Baseline, CD, DoLa with different levels of repetition penalty on StrategyQA.

Figure 8: Baseline, CD, DoLa with different levels of repetition penalty on GSM8K.

In Section2.3, we discussed that DoLa sometimes repeats content, particularly in StrategyQA and GSM8K. To mitigate this, we apply a repetition penalty. Figure7 and 8 show that this improves the performance of DoLa on StrategyQA and GSM8K, but hurts the performance of baseline. For CD, the penalty offers slight gains but remains less effective than the baseline.

Appendix L Additional Examples for Qualitative Study on TruthfulQA

In Table3, we show additional examples for comparing the responses from LLaMA-33B with and without DoLa. All the responses are generated using greedy decoding.

Table 16: Additional short response examples from LLaMA-33B and DoLa with the questions from TruthfulQA.

Appendix M Qualitative Study for Pairwise Comparison by GPT-4

We show several examples in Vicuna QA with the long-sequence responses by LLaMA-33B, with and without DoLa, along with the judgment by GPT-4. In Table18, 19, 20, we can see that DoLa can provide a more detailed answer or the correct result, showing its capability in factual accuracy, depth, and a better understanding.

Besides the examples that DoLa outperforms the baseline, we also show examples that DoLa underperforms the baseline by GPT-4 judgment in Table21 and 22. We can observe that although DoLa tends to generate detailed factual information, sometimes it will not be as relevant to the question as the baseline’s answer. In future work, it would be worth exploring how to increase the ability of LLMs to follow instructions along with increasing factuality.

Appendix N Small Language Models

We apply DoLa to the GPT2 Medium model with 335M parameters, to mimic the setting of Gera et al. (2023) and investigate the effects of DoLa on small LMs. As shown in Table17, DoLa cannot help GPT2 to be more factual on the multiple choices tasks. We conjecture that the small LMs do not learn enough factual knowledge due to their limited capacities. Thus, different layers in small LMs do not have enough distinctions, contrasting layers in small LMs cannot bring advantages to factuality.

Table 17: Applying DoLa to GPT2-Medium for multiple choices tasks.

Table 18: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.

Table 19: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.

Table 20: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.

Table 21: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.

Table 22: Qualitative Study for LLaMA-33B and DoLa with GPT-4 judgement.

Xet Storage Details

Size:: 86 kB
Xet hash:: a16bd1a966fa58b22f6250c69d9eccebf83d0218cd2b985650d88bfe01c8eacf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.