65.4 kB

Title: KL-Divergence Guided Temperature Sampling

URL Source: https://arxiv.org/html/2306.01286

Published Time: Fri, 01 Dec 2023 02:02:02 GMT

Markdown Content:

Abstract

Temperature sampling is a conventional approach to diversify large language model predictions. As temperature increases, the prediction becomes diverse but also vulnerable to hallucinations – generating tokens that are sensible but not factual. One common approach to mitigate hallucinations is to provide source/grounding documents and the model is trained to produce predictions that bind to and are attributable to the provided source. It appears that there is a trade-off between diversity and attribution. To mitigate any such trade-off, we propose to relax the constraint of having a fixed temperature over decoding steps, and a mechanism to guide the dynamic temperature according to its relevance to the source through KL-divergence. Our experiments justifies the trade-off, and shows that our sampling algorithm outperforms the conventional top-k and top-p algorithms in conversational question-answering and summarization tasks.

1 Introduction

The generative large language models (LLMs) have enabled many applications in conversations (Thoppilan et al., 2022), code generations (Chen et al., 2021), and art creations (Yu et al., 2022). However, LLMs are known to hallucinate. While hallucinations may be fine for creative tasks, e.g. storytelling and art creations, it is not tolerable for other tasks, e.g. fact-seeking question-answering (QA) and text summarization.

There are a few reasons why LLMs hallucinate. One reason is that they are trained on massive datasets of text, which can contain errors and biases. Furthermore, if the model is asked to predict when the input text pattern is not supported by the training data distributions, it may therefore hallucinate information to complete the prediction plausibly.

To overcome the limitation in the training data, an alternative is to incorporate LLMs with an external trusted knowledge source, such as search engines (Thoppilan et al., 2022) and retrieval systems (Borgeaud et al., 2022). These alternative approaches with an external trusted knowledge source transforms a factuality problem into query generation and source grounding problems. We call the setting of generating responses with source/grounding documents provided as contextual predictions.

Nonetheless, even if the query generation and the external trusted knowledge source are perfect and always provide a golden source, there is no guarantee that the generated responses are grounded on the source. One such factor is randomness in temperature sampling. As depicted in (Holtzman et al., 2019), the text generated from pure sampling (T=1.0 𝑇 1.0 T=1.0 italic_T = 1.0) is incoherent and almost unrelated to the source. The paper suggested the "unreliable tail" is to blame. The unreliable tail is composed of low probable tokens which aggregate to unignorable probability mass in sampling. When a token in the unreliable tail is sampled, there is a chance that hallucination happens. The paper proposed nucleus sampling (top-p) as a static solution to capture the region of confidence to avoid the unreliable tail.

We hypothesize that there is a trade-off between diversity and attribution in contextual predictions. This paper examines this trade-off and proposes a decoding method 1 1 1 Code is available at https://github.com/google-research/google-research/tree/master/kl_guided_sampling to address it. The proposed decoder dynamically adjusts temperature at each decoding step, instead of having a fixed temperature in all decoding steps. There are two components in the mechanism: (1) A signal to indicate whether the source is relevant to the current decoding step; (2) A converter to take this signal as input and adjust sampling accordingly. For (1), we propose an original idea to leverage KL-divergence as a guiding signal to tell whether the source is relevant; For (2), we leverage the finding in (Aksitov et al., 2023) and take temperature as the knob for attributions, besides diversity, where the authors suggested that temperature and model size has similar scale of effect on attributions.

2 Background

2.1 Contextual Predictions

In contextual predictions, LLM is expected to generate a response that is attributable to the source, if the source is relevant. From the perspective of natural language generation (NLG) tasks, the contextual prediction task is a prefix LM task, where source is part of the prefix. For an encoder-decoder model architecture, the source is part of the input to the encoder; For a decoder-only architecture, source is part of the prefix to the decoder.

In KL-divergence guided temperature sampling, for the purpose of comparing the distributions between two parallel decodings: one with source in the input and one without, the model needs to be able to handle inputs with and without source. In other words, training data must include examples with and without sources.

For some NLG tasks, the notion of source is explicit and the model being able to handle either cases (having source or not) comes for free from the model pretraining. For a conversational QA task over QReCC datasets in Section 6.2 in (Aksitov et al., 2023), the authors encapsulated conversations and an optional source into inputs with advanced promptings.

For some other NLG tasks, on the other hand, the source is implicit and the notion of not having source needs to be handled carefully. Consider a text summarization task, most training examples have pairs of a paragraph and a summary. The model has never seen empty input in training time and hence the prediction without source in inference time may be unexpected. One simple workaround is to train the model with empty inputs, as if it is a causal LM task. Please see more discussion in Section 7.

2.2 Temperature Sampling

In a conventional decoding step with the temperature sampling algorithm, the temperature is used to adjust vocabulary probability mass function (PMF) according to

Pr⁡(v k)=e l k/T∑i e l i/T Pr subscript 𝑣 𝑘 superscript 𝑒 subscript 𝑙 𝑘 𝑇 subscript 𝑖 superscript 𝑒 subscript 𝑙 𝑖 𝑇\Pr(v_{k})=\frac{e^{l_{k}/T}}{\sum_{i}e^{l_{i}/T}}roman_Pr ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG(1)

where v k subscript 𝑣 𝑘 v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k 𝑘 k italic_k-th vocabulary token, l k subscript 𝑙 𝑘 l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the corresponding logit, and T 𝑇 T italic_T a constant temperature. In this work, we follow the implementation of temperature sampling in the T5X library 2 2 2 See https://github.com/google-research/t5x..

In inference time, the temperature T 𝑇 T italic_T is adjusted:

•When T=0 𝑇 0 T=0 italic_T = 0, the PMF becomes a Kronecker delta function, and the algorithm degenerates to a greedy algorithm, where the response is deterministic and likely repetitive;
•With larger T 𝑇 T italic_T, the PMF becomes more evenly distributed. Tokens with low probability mass become more likely to be sampled, thus, the generated output becomes diverse but comes with a chance that hallucination happens. Randomness in sampling avoided repetitive responses in the greedy algorithm.

Please note that keeping the same temperature to all vocabulary v k subscript 𝑣 𝑘 v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT helps preserve the order of the PMF, and this order helps to preserve the sensibleness of the predictions.

2.3 KL-Divergence

KL-divergence is a metric to measure the statistical difference between two distributions. The formula is as follows:

K L(p||q)=∑k p k log p k q k KL(p||q)=\sum_{k}p_{k}\log\frac{p_{k}}{q_{k}}italic_K italic_L ( italic_p | | italic_q ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG(2)

where p={p k}𝑝 subscript 𝑝 𝑘 p={p_{k}}italic_p = { italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and q={q k}𝑞 subscript 𝑞 𝑘 q={q_{k}}italic_q = { italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } are two PMFs.

At each decoding step, let p 𝑝 p italic_p be the token PMF where model input contains source and q 𝑞 q italic_q the token PMF where model input contains no source. With this convention, the KL-divergence is a measure of how source matters:

•If the KL-divergence is small, the distribution p 𝑝 p italic_p and q 𝑞 q italic_q are similar irregardless of whether the source is provided. In other words, the source is irrelevant;
•If the KL-divergence is large, the presence of the source is consequential to the PMF. In other words, the source matters.

Mathematically, we can interpret KL-divergence as the mean of the pointwise mutual information (PMI). Let s 𝑠 s italic_s be the source and x={x t}𝑥 subscript 𝑥 𝑡 x={x_{t}}italic_x = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } the generated response. At each decoding step t 𝑡 t italic_t, we have p k=Pr⁡(x t=v k|x<t,s)subscript 𝑝 𝑘 Pr subscript 𝑥 𝑡 conditional subscript 𝑣 𝑘 subscript 𝑥 absent 𝑡 𝑠 p_{k}=\Pr(x_{t}=v_{k}|x_{<t},s)italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Pr ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_s ) and q k=Pr⁡(x t=v k|x<t)subscript 𝑞 𝑘 Pr subscript 𝑥 𝑡 conditional subscript 𝑣 𝑘 subscript 𝑥 absent 𝑡 q_{k}=\Pr(x_{t}=v_{k}|x_{<t})italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Pr ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) where v k subscript 𝑣 𝑘 v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k 𝑘 k italic_k-th vocabulary token, x<t={x 0,…,x t−1}subscript 𝑥 absent 𝑡 subscript 𝑥 0…subscript 𝑥 𝑡 1 x_{<t}={x_{0},\dots,x_{t-1}}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } is the set of all previous decodes, and Pr⁡(⋅)Pr⋅\Pr(\cdot)roman_Pr ( ⋅ ) stands for the empirical probability calculated by the LLM with T=1 𝑇 1 T=1 italic_T = 1. As a result,

log⁡p k q k=log⁡Pr⁡(x t=v k,s|x<t)Pr⁡(x t=v k|x<t)⁢Pr⁡(s|x<t),subscript 𝑝 𝑘 subscript 𝑞 𝑘 Pr subscript 𝑥 𝑡 subscript 𝑣 𝑘 conditional 𝑠 subscript 𝑥 absent 𝑡 Pr subscript 𝑥 𝑡 conditional subscript 𝑣 𝑘 subscript 𝑥 absent 𝑡 Pr conditional 𝑠 subscript 𝑥 absent 𝑡\log\frac{p_{k}}{q_{k}}=\log\frac{\Pr(x_{t}=v_{k},s|x_{<t})}{\Pr(x_{t}=v_{k}|x% {<t})\Pr(s|x{<t})},roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = roman_log divide start_ARG roman_Pr ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_s | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Pr ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) roman_Pr ( italic_s | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) end_ARG ,

which defines the PMI between the source s 𝑠 s italic_s and the decode at step t 𝑡 t italic_t being v k subscript 𝑣 𝑘 v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, given all previous decodes x<t subscript 𝑥 absent 𝑡 x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT.

The PMI measures the associations between s 𝑠 s italic_s and x t=v k subscript 𝑥 𝑡 subscript 𝑣 𝑘 x_{t}=v_{k}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT given x<t subscript 𝑥 absent 𝑡 x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT. Conditioned on previous decodes x<t subscript 𝑥 absent 𝑡 x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT, when PMI =0 absent 0=0= 0, they are independent; when PMI >0 absent 0>0> 0, they are positively associated; otherwise, they are negatively associated. According to equation (2), KL-divergence is the mean of the PMIs over the PMF p=Pr⁡(x t|x<t,s)𝑝 Pr conditional subscript 𝑥 𝑡 subscript 𝑥 absent 𝑡 𝑠 p=\Pr(x_{t}|x_{<t},s)italic_p = roman_Pr ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_s ). In other words, at each decoding step before sampling, the KL-divergence tells us, on average, how this decoding step associates with the source.

2.4 Metrics

2.4.1 Metrics for Attribution

We follow the convention (Rashkin et al., 2023) for the definition of attribution, and in (Rashkin et al., 2023; Honovich et al., 2022) to cast an attribution evaluation as an instance of natural language inference (NLI) task by treating source as the premise and the response as the hypothesis.

For the conversational QA tasks, we follow the conventions in Section 5.1 of (Aksitov et al., 2023) and use a T5-11B model finetuned on MNLI, SNLI, FEVER, PAWS, SciTail and VitaminC 3 3 3 A newer version from the authors.. We follow the flavor v3 in the paper where the NLI premise and hypothesis are "{Source} {Question}" and "{Question} {Answer}", respectively, for the QA tasks, and follow the convention to adapt the NLI when the premise is very long.

For the summarization task, we follow the conventions in Section 3 of (Aharoni et al., 2022). We use the same model, a.k.a. a mT5-XXL model finetuned on ANLI and XNLI. We also follow the convention of having the NLI premise and hypothesis as "{Source}" and "{Summary}", respectively.

2.4.2 Metrics for Diversity

We adopt the likelihood evaluation metrics in (Holtzman et al., 2019). However, we only present the experiment results with self-BLEU4. According to the paper, The metrics self-BLUE4, self-BLUE5, and zipf are similar, see Figure 8 and Section 5.2. It has been argued that perplexity is not the best metric for measuring the performance of language models because natural language does not always follow the most probable path. Finally, we dropped repeated n-gram as we don’t see it as a major issue, though it exists, in our experiments.

Besides conventional metrics, we create an additional metric, var-rank, to measure the variance of the token ranks. The metric var-rank is a direct statistical measurement of how temperature affects sampling. For example, in a greedy algorithm, all tokens have rank 0 0 and, hence, var-rank equals 0 0. In sampling with top-k equals 40 40 40 40, all token ranks are in the range of [0,39]0 39[0,39][ 0 , 39 ].

For each example k 𝑘 k italic_k in the evaluation dataset, we generate one decode, represented by token IDs I k={i k⁢j}subscript 𝐼 𝑘 subscript 𝑖 𝑘 𝑗 I_{k}={i_{kj}}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_i start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT } and the token ranks R k={r k⁢j}subscript 𝑅 𝑘 subscript 𝑟 𝑘 𝑗 R_{k}={r_{kj}}italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT } where i k⁢j subscript 𝑖 𝑘 𝑗 i_{kj}italic_i start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT is a positive integer (and not including the padding 0 0 s) and r k⁢j subscript 𝑟 𝑘 𝑗 r_{kj}italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT is a non-negative integer. For all examples we have two lists of lists: one for token IDs I^={I k}^𝐼 subscript 𝐼 𝑘\hat{I}={I_{k}}over^ start_ARG italic_I end_ARG = { italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and one for token ranks R^={R k}^𝑅 subscript 𝑅 𝑘\hat{R}={R_{k}}over^ start_ARG italic_R end_ARG = { italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. For var-rank, we flatten I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG into a single list and calculate the variance. For self-BLEU4, we calculate self-bleu by comparing all pairs of outputs across the entire dataset. For var-rank, the higher the better (or more diverse). For self-BLEU, the lower the better (or more diverse).

3 KL-Divergence Guided Temperature Sampling

3.1 Architecture

Figure 1: System architecture

The system architecture is illustrated in Figure 1. For convenience, we take encoder-decoder models as an example, but the methodology also works for decoder-only models. Assuming data preprocessors and feature converters have prepared each example with two inputs, the model runs two parallel decodings: the bottom one with source in the input and the top one without source in the input.

In each decoding step, both decoders first compute logits and the token PMFs. The PMFs from two parallel decodings are used to calculate the KL-divergence according to the equation (2). The converter takes this KL-divergence as input and adjusts the temperature according to the equation (3), see Section 4. Finally, we apply the temperature to the logits according to the equation (1), followed by any additional top-k and top-p masking, and random sampling. The sampled next token is fed back to both parallel decoders for the next decoding step.

4 Converter

The converter takes KL-divergence as input and computes the temperature for the decoding step. For simplicity, we use the following exponential decay function:

T=T 0⋅(1 2)K L(p||q)σ T=T_{0}\cdot\left(\frac{1}{2}\right)^{\frac{KL(p||q)}{\sigma}}italic_T = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_K italic_L ( italic_p | | italic_q ) end_ARG start_ARG italic_σ end_ARG end_POSTSUPERSCRIPT(3)

where T 0 subscript 𝑇 0 T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the baseline temperature and σ 𝜎\sigma italic_σ is a hyperparameter to specify the half-life cycle of the decay. Please note that when σ 𝜎\sigma italic_σ is very small, this function degenerates to T=0 𝑇 0 T=0 italic_T = 0; when σ 𝜎\sigma italic_σ is very large, this function degenerates to T=T 0 𝑇 subscript 𝑇 0 T=T_{0}italic_T = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Figure 2: converter function

Figure 3 demonstrates the KL-divergence and the corresponding temperature over decoding steps in a real example.

Figure 3: An example of KL-divergence to temperature sequences over decoding steps with σ=0.1 𝜎 0.1\sigma=0.1 italic_σ = 0.1.

5 Experiments

5.1 Baseline and Guided Groups

In the following experiments, we will compare different decoding algorithms with the metrics in Section 2.4. We categorize the experiments into two groups: baseline group and guided groups. In the baseline group, there are three decoding algorithms, namely baseline T (T stands for temperature), baseline top-p, and baseline top-k, with the following settings:

•Baseline T: top-k=40 40 40 40, top-p=1.0 1.0 1.0 1.0, T ∈\in∈{0{0{ 0, 0.1 0.1 0.1 0.1, ⋯⋯\cdots⋯, 0.9 0.9 0.9 0.9, 1.0}1.0}1.0 }
•Baseline top-p: top-k=all, T=1.0 1.0 1.0 1.0, top-p ∈\in∈{0{0{ 0, 0.01 0.01 0.01 0.01, 0.05 0.05 0.05 0.05, 0.1 0.1 0.1 0.1, 0.2 0.2 0.2 0.2, ⋯⋯\cdots⋯, 0.8 0.8 0.8 0.8, 0.9 0.9 0.9 0.9, 0.95 0.95 0.95 0.95, 0.99 0.99 0.99 0.99, 1.0}1.0}1.0 }
•Baseline top-k: T=1.0 1.0 1.0 1.0, top-p=1.0 1.0 1.0 1.0, top-k ∈\in∈{1{1{ 1, 2 2 2 2, 5 5 5 5, 10 10 10 10, 20 20 20 20, 40 40 40 40, 80 80 80 80, 160 160 160 160, 320 320 320 320, 640 640 640 640, 1280 1280 1280 1280, all}}}}

Please note that when top-p=1.0 1.0 1.0 1.0, it is equivalent to not applying top-p masking, and similarly for top-k=all.

These baseline experiments are illustrated in Figure 4, where the axes are T, top-k, top-p, respectively. There are three intersections between the baselines:

•Baseline top-p intersects with baseline top-k at (top-k=all, top-p=1 1 1 1, T=1 1 1 1);
•Baseline T intersects with baseline top-k at (top-k=40 40 40 40, top-p=1 1 1 1, T=1 1 1 1);
•Three open-ended ends of all baselines implicitly intersect at the greedy algorithm (top-k=1 1 1 1, or top-p=0 0, or T=0 0).

In the guided groups, there are two decoding algorithms guided by KL-divergence, namely guided T and guided top-p, with the following settings:

•Guided T: top-k=40 40 40 40, top-p=1.0 1.0 1.0 1.0, fix T 0 0{}_{0}start_FLOATSUBSCRIPT 0 end_FLOATSUBSCRIPT=0.7 0.7 0.7 0.7 but change σ∈{1 E−4\sigma\in{1E^{-4}italic_σ ∈ { 1 italic_E start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 3⁢E−4 3 superscript 𝐸 4 3E^{-4}3 italic_E start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 1⁢E−3 1 superscript 𝐸 3 1E^{-3}1 italic_E start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 3⁢E−3 3 superscript 𝐸 3 3E^{-3}3 italic_E start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 1⁢E−2 1 superscript 𝐸 2 1E^{-2}1 italic_E start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 3⁢E−2 3 superscript 𝐸 2 3E^{-2}3 italic_E start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 0.1 0.1 0.1 0.1, 0.3 0.3 0.3 0.3, 1 1 1 1, 3 3 3 3, ∞}\infty}∞ }
•Guided top-p: top-k=all, T 0 0{}_{0}start_FLOATSUBSCRIPT 0 end_FLOATSUBSCRIPT=1.0 1.0 1.0 1.0, fix top-p=0.95 0.95 0.95 0.95 but change σ∈{1 E−4\sigma\in{1E^{-4}italic_σ ∈ { 1 italic_E start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 3⁢E−4 3 superscript 𝐸 4 3E^{-4}3 italic_E start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 1⁢E−3 1 superscript 𝐸 3 1E^{-3}1 italic_E start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 3⁢E−3 3 superscript 𝐸 3 3E^{-3}3 italic_E start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 1⁢E−2 1 superscript 𝐸 2 1E^{-2}1 italic_E start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 3⁢E−2 3 superscript 𝐸 2 3E^{-2}3 italic_E start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 0.1 0.1 0.1 0.1, 0.3 0.3 0.3 0.3, 1 1 1 1, 3 3 3 3, ∞}\infty}∞ }

The naming may be confusing as guided top-p experiment doesn’t change top-p. Instead, KL-divergence is guiding the temperature while keeping top-p fixed. We name it guided top-p as it is meant to be compared with baseline top-p. Similarly, guided T is meant to be compared with baseline T.

Figure 4: Baseline experiments

Figure 5: Guided experiments

6 Conversational QA tasks

6.1 Procedure

We use the dataset Question Rewriting in Conversational Context (QReCC) (Anantha et al., 2020) – a knowledge intensive open-domain QA dataset consisting of 14K conversations with 81K QA pairs. Each conversation consists of a series of questions and answers as conversation history, followed by a final query. A golden answer, along with the webpage the answer was extracted from, is also provided for each conversation.

We follow the same setting in (Aksitov et al., 2023) to preprocess and filter datasets, and construct queries to a LLM. In preprocessings, we use a fully decontextualized version of QReCC in our experiments. The webpage where the answer was extracted from is crawled and processed as text. We follow Table 3 in the paper to apply filters to the dataset to remove examples that are not well-formed. We take the final 2829 examples in the training split for our experiments.

We format the input queries to PaLM (Chowdhery et al., 2022) models with advanced promptings as described in Section 6.2 in (Aksitov et al., 2023):

•For the input with source: without instructions, with the golden evidence, and with dialog history;
•For the model input without instructions: without instructions and evidence, and with dialog history.

We take the first turn (ends by the first [eot]) in the generated response as the model’s response to the conversation.

Figure 6 shows the attribution-diversity trade-off curves of different decoding algorithms over the PaLM-8B model. Each datapoint is the summary of attribution and diversity of all 2829 QReCC examples under a specific algorithm. For each example, we construct and send the input query to the PaLM-8B model to generate a response and token ranks in all decoding steps. We grade attribution with E2E NLI on each response and take the mean over all responses (y-axis in Figure 6). We grade diversity by var-rank and self-BLEU4. For the former, we flatten all token ranks from all examples and calculate the variance (x-axis in Figure 6 left); For the latter, we follow the convention in (Holtzman et al., 2019) (x-axis in Figure 6 right). Please note that the x-axis is presented in log scale for var-rank. Since the greedy algorithm has rank to be all 0s, the variance of the rank (x-axis) is at negative infinity, causing the artifact of the horizontal lines in Figure 6 left.

6.2 Results

First, let’s look at three baseline experiments in Figure 6 left. As sanity checks, baseline top-k (green) intersects baseline T (blue) and baseline top-p (red) once for each, and they intersect at x=−∞-\infty- ∞ (greedy algorithm) implicitly. Furthermore, all three lines show the trade-off between diversity and attributions. Ideally, we prefer to have a decoding algorithm with its datapoint at the top-right corner, meaning that the algorithm has high diversity and high attributions. From this perspective, baseline top-p is better than baseline T, and baseline top-k is the worst.

To compare the baseline group with the guided group, we compare baseline T (blue) with guided T (cyan), and baseline top-p (red) with guided top-p (magenta), respectively. Evidently, guided T outperforms baseline T marginally while guided top-p outperforms baseline top-p significantly. This is also justified by Figure 6 right, where the top-left corner is better.

Concrete examples of the generated texts are in Appendix A. We present several examples to illustrate that the algorithm works as intended. For illustration, we select one algorithm from the guided top-p and two algorithms from baseline top-p so that they are aligned horizontally or vertically for comparison. We also color text tokens by temperatures. These examples are cherry-picked. In practice, many examples in the guided top-p are err on either side of the baseline counterparts. That is, there are many examples with all responses being identical (or not diverse), and many with responses hallucinating.

We applied the same set of experiments to the PaLM-62B model in Figure 7. It is expected that PaLM-62B is more performant than PaLM-8B, and this is justified by the fact that all curves shift upward. For example, with the greedy algorithm, attribution is increased from 61% (PaLM-8B) to 67.5% (PaLM-62B). For the comparison of the trade-off curves among baseline and guided groups, the trend is similar:

•Comparing the baseline experiments: baseline top-p >>> baseline T >>> baseline top-k;
•Comparing the baseline and guided experiments, guided top-p ≫much-greater-than\gg≫ baseline top-p and guided T >>> baseline T.

In summary, conversational QA is a perfect example to illustrate the advantage of the KL-divergence guided temperature sampling. In a conversational QA setting, there are several previous turns to act as few-shots to the final QA task. One nature of QA tasks is that the information provided by the source is concentrated in a few response tokens, which makes room to selectively change temperatures in decoding steps. Although the proposed approach cannot mitigate the hallucinations in low temperatures and greedy algorithms, these errors can be mitigated through increasing the model size.

Figure 6: QReCC dataset over PaLM-8B model

Figure 7: QReCC dataset over PaLM-62B model

Figure 8: XLSum dataset over MT5 XL model

Figure 9: XLSum dataset over MT5 XL model with optional inputs. In all figures, y-axis represents attribution with the automatic metric E2E NLI, and x-axis represents diversity with the automatic metrics var-rank (left) and self-BLEU4 (right). Each figure compares the trade-off curves of 5 5 5 5 experiments. In particular, we compare baseline T with guided T, and baseline top-p with guided top-p.

7 Summarization tasks

7.1 Procedure

We take XLSum (Hasan et al., 2021), a multilingual abstractive summarization for 44 Languages, as the finetuning and evaluation datasets. We first finetune the pretrained MT5-XL model (Xue et al., 2020) with the XLSum training split for 15000 15000 15000 15000 steps, batch_size 16 16 16 16, sequence length = {’inputs’: 1024 1024 1024 1024, ’targets’: 128 128 128 128}, and with all 44 44 44 44 languages. We evaluate this finetuned model over the validation split for 3000 3000 3000 3000 examples in 13 13 13 13 languages supported by XNLI (for valid attribution metric).

However, the notion of not having a source is not well-posed. Conventionally, the input to the model is the article to be summarized. While it is natural to define the whole input as the source, implying that not having source refers to an empty input, the LLM is never finetuned with empty inputs in the summarization tasks.

7.2 Results

To begin with, let’s naively treat all input text as a source. The results are shown in Figure 8. The guided group is at least as performant as that of its corresponding baseline group. The datapoints in the guided group are shifted to the left, meaning that the majority of the KL-divergences of the guided group are in [3.0 3.0 3.0 3.0, ∞\infty∞). This is because the model has never trained with empty inputs, as a result, the PMF with empty inputs becomes very different from that with normal inputs, which in turns causes the KL-divergence to be always large.

To overcome the issue, we modified the finetune summarization tasks to include both intact and empty inputs, so that the model will summarize the input article when it exists; otherwise, it will generate/hallucinate sensible short summary texts. In Figure 9 left, we notice that the guided top-p (magenta) is more performant than the guided top-p (red), and similarly for the comparison between guided T (cyan) and baseline T (blue), although the gap is smaller. In other words, for the same diversity, guided top-p has higher attribution than baseline top-p; for the same attribution, guided top-p has higher diversity than baseline top-p. A similar conclusion can be drawn from Figure 9 right, where the diversity axis (x-axis) is replaced by self-BLEU4. Concrete examples of the generated texts to compare baseline top-p (top-p=0.4 0.4 0.4 0.4 and 0.7 0.7 0.7 0.7) and guided top-p (σ 𝜎\sigma italic_σ=1.0 1.0 1.0 1.0) are in Appendix B.

8 Discussion

According to the experiments, we justified the hypothesis that there is a trade-off between attribution and diversity, and our proposed algorithm mitigates the trade-off. However, it is evident that the gain in conversational QA tasks is more significant than that in the summarization tasks.

For many NLG tasks, the source only matters for a few tokens in the response, and the rest of the response tokens are just language that glues the information together. For example, in a QA task with the question "What is the weather today?" along with source from a weather forecast website, the response could be:

•The weather today is rainy.
•It is rainy today.
•The answer is rainy.

Notice that only the token rainy is relevant to the source. Because the relevance is only concentrated to a few tokens only, this provides the opportunity to selectively adjust temperatures.

The methodology is not perfect. One caveat is that the computation is doubled as we are having two parallel decodings. Another caveat is that when the KL-divergence is small, the source can be truly irrelevant, or the source is relevant but not consequential, e.g. the model already knows the fact from its memory, so the source doesn’t provide additional information. Having a low KL-divergence does not necessarily mean the source is irrelevant, but having a high KL-divergence definitely means the source matters.

9 Conclusion

This paper proposed a decoding algorithm to improve LLM attributions when source is present. The algorithm consists of two parallel decoding steps, one with source in the input and one without. The KL-divergence between two token distributions indicates how the source is relevant to the token to be sampled in the decoding step. We use this signal to adjust the sampling temperature to improve attribution when source is relevant, and to improve diversity when source is irrelevant. Although the algorithm requires doubling the computations for parallel decodings, our experiments on conversational and summarization tasks show improvements over the conventional top-k and top-p algorithms, in particular, with large margin over conversational QA tasks.

Limitations

Compared with the conventional temperature sampling, our proposed algorithms requires twice computations for two parallel decodings. State of the art LLMs (∼similar-to\sim∼ O(100B) parameters) are expensive to serve. Any algorithm that further increases the computation may hinder its real world applicability.

On the other hand, the algorithm also requires models to be large enough for two reasons: First, the attribution of the algorithm is roughly upper bounded by the greedy algorithm of the LLM. The greedy algorithm’s performance is improved as the model size increases. Next, the algorithm relies on the in-context learning in the LLM it is paired with. In particular, the two token PMFs, one with source in the input and one without, should be selectively different according to its relevance to the source. The in-context learning is an emergent ability that only appears after a model is large enough.

Ethics Statement

The proposed sampling algorithm requires to be paired with an additional LLM and source providers, e.g. retrieval systems or search engines. The integrated system may inherit the bias and privacy concerns in the selected LLMs and source providers. Compared with temperature sampling, the responses generated by our proposed algorithm are within the counter of those of the temperature sampling. Our proposed algorithm only eliminates (and does not add) responses that are likely hallucinating, when compared with its baseline.

References

Aharoni et al. (2022) Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, and Mirella Lapata. 2022. mface: Multilingual summarization with factual consistency evaluation. arXiv preprint arXiv:2212.10622.
Aksitov et al. (2023) Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yunhsuan Sung. 2023. Characterizing attribution and fluency tradeoffs for retrieval-augmented large language models. arXiv preprint arXiv:2302.05578.
Anantha et al. (2020) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2020. Open-domain question answering goes conversational via question rewriting. arXiv preprint arXiv:2010.04898.
Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Hasan et al. (2021) Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M Sohel Rahman, and Rifat Shahriyar. 2021. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. arXiv preprint arXiv:2106.13822.
Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
Honovich et al. (2022) Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
Rashkin et al. (2023) Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2023. Measuring attribution in natural language generation models. to appear in Computational Linguistics.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
Yu et al. (2022) Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.

Appendix A Examples from QReCC

According to Figure 6, we choose guided top-p (σ 𝜎\sigma italic_σ=0.1 0.1 0.1 0.1), baseline top-p (=0.3 0.3 0.3 0.3), and baseline top-p (=0.7 0.7 0.7 0.7) for comparison. Responses are deduplicated and some source text is skipped to save space. For each example, we decode the same input 10 10 10 10 times. See Table 1, Table 2, and Table 3.

Query Fact: Bure was selected 113th overall in the sixth round Draft by the Vancouver Canucks, following his rookie season with CSKA Moscow. The pick was controversial, as the Canucks had chosen him seemingly a year ahead of his eligible draft season. At the age of 18, he was available to be chosen in the first three rounds of the draft, but to be selected any later, he would have needed to play at least two seasons—with a minimum of 11 games per season—for his elite-level Soviet club, the Central Red Army. [22] While most teams believed he was ineligible, the Canucks’ head scout at the time, Mike Penny, discovered Bure had played in additional exhibition and international games to make him an eligible late-round draft choice a year early. [23] [24] Jack Button, the Washington Capitals ’ director of player personnel, admitted "everybody would have taken him earlier. We assumed he was not eligible… you’ve got to give the Canucks credit for doing their homework." [25] Several other teams either had similar knowledge or had pursued Bure, but there was confusion as to the legitimacy of the extra games. The Detroit Red Wings had asked league vice president Gil Stein about Bure’s availability before their fifth-round pick, but were told he was not eligible. [22] They later decided to select him with their sixth-round pick, 116th overall, and settle his eligibility later. The Canucks selected Bure three picks ahead of Detroit’s turn. [26] Meanwhile, Winnipeg Jets general manager Mike Smith , claimed he made an offer to the Soviet Ice Hockey Federation that would involve three years of transfer payments before Bure would be allowed to join the Jets; however Smith did not have any plans to draft Bure in 1989 as he believed he was ineligible. [27]

[eot] 0 -1 0 What country was Pavel Bure born? [eot] 1 0 1 Pavel Bure was born in Moscow in 1971. [eot] 2 1 0 What team did Pavel Bure play for as an adult? [eot] 3 2 1 Greedy algorithm

The Vancouver Canucks. [x10] Baseline top-p (=0.3)
Pavel Bure played for the Vancouver Canucks. [x4]
The Vancouver Canucks. [x6] Baseline top-p (=0.7)
Pavel Bure played for the Vancouver Canucks from 1 9 9 1 to 1 9 9 4.
Vancouver Canucks. [x3]
Pavel Bure played for the Vancouver Canucks in 1 9 9 1-9 2, and 1 9 9 2-9 3.<0x0A><0x0A>Source: https://en.wikipedia.org/wiki/P avel _ Bu re
Bure played for the Washington Capitals from 1 9 9 3 to 1 9 9 7.
The Florida Panthers. [x2]
Pavel Bure played for the Vancouver Canucks for 1 4 years, playing from 1 9 9 1 to 2 0 0 4.
The New York Rangers. Guided top-p (=0.1)
The Vancouver Canucks.
Pavel Bure played in the National Hockey League for the Vancouver Canucks.
Pavel Bure played professional hockey with the Vancouver Canucks of the National Hockey League.
Pavel Bure played for the Vancouver Canucks. [x4]
Pavel Bure played for the Vancouver Canucks for 2 0 seasons.
Pavel Bure played 2 0 NHL seasons for the Vancouver Canucks (1 9 8 9-1 9 9 6) and Florida Panthers (1 9 9 6-2 0 0 2)
Pavel Bure played professional hockey in the National Hockey League (NHL) for the Vancouver Canucks, New York Rangers and Florida Panthers.<0x0A><0x0A>He scored 3 4 6 goals and 7 2 1 assists for 1,0 6 7 points, as well as 1,4 3 7 penalty minutes in 1,3 2 3 career games.

Table 1: Prediction examples of the QReCC conversational QA task.

Query Fact: 4.5.4 Ecuadorian 4.5.5 Colombian 4.5.6 Salvadoran 5 See also 6 References 7 External links Population [ edit ] Historical population Year Pop. ±% 1698 4,937 — 1712 5,840 +18.3% 1723 7,248 +24.1% 1737 10,664 +47.1% 1746 11,717 +9.9% 1756 13,046 +11.3% 1771 21,863 +67.6% 1790 33,131 +51.5% 1800 60,515 +82.7% 1810 96,373 +59.3% 1820 123,706 +28.4% 1830 202,589 +63.8% 1840 312,710 +54.4% 1850 515,547 +64.9% 1860 813,669 +57.8% 1870 942,292 +15.8% 1880 1,206,299 +28.0% 1890 1,515,301 +25.6% 1900 3,437,202 +126.8% 1910 4,766,883 +38.7% 1920 5,620,048 +17.9% 1930 6,930,446 +23.3% 1940 7,454,995 +7.6% 1950 7,891,957 +5.9% 1960 7,781,984 -1.4% 1970 7,894,862 +1.5% 1980 7,071,639 -10.4% 1990 7,322,564 +3.5% 2000 8,008,288 +9.4% 2010 8,175,133 +2.1% 2014 8,491,079 +3.9% 2018 8,398,748 [19] -1.1% 1880 & 1890 figures include part of the Bronx. Beginning with 1900, figures are for consolidated city of five boroughs. For the same area before 1900, see #Historical Population data , below. Sources: 1698–1771, [15] 1790–1990, [16] 2000 and 2010 Censuses, [17] and 2014 Census estimates [18] New York City is the most populous city in the United States, with an estimated 8,491,079 people living in the city, according to U.S. Census estimates dating July 1, 2014 [18] (up from 8,175,133 in 2010; 8.0 million in 2000; and 7.3 million in 1990). [17] This amounts to about 40% of New York State’s population and a similar percentage of the metropolitan regional population. New York’s two key demographic features are its population density and cultural diversity . The city’s population density of 26,403 people per square mile (10,194/km²), makes it the densest of any American municipality with a population above 100,000. [20] Manhattan ’s population density is 66,940 people per square mile (25,846/km²), highest of any county in the United States. [21] [22]

[eot] 0 -1 0 what percent of the us population lives in new york city? [eot] 1 0 1 2.6% of the US population lives in New York City [eot] 2 1 0 how many people live in New York City? [eot] 3 2 1 Greedy algorithm

8.4 9 1 million people live in New York City [x10] Baseline top-p (=0.3)
8.4 9 1 million people live in New York City [x10] Baseline top-p (=0.7)
8.4 9 1 million people live in New York City [x2]
8.4 9 million people live in New York City.
8,4 9 1,0 7 9 people live in New York City
more than 8.5 million people
more than the population of any other state.
8,1 7 5,1 3 3 of the 1 2.5 million people living in New York State live in New York City
8.4 million people live in New York City.
8,1 7 5,1 3 3 in 2 0 1 0 and 8,4 9 1,0 7 9 in 2 0 1 4. [1 7] New York City has the largest city-wide metropolitan population in the United States, at 2 0.3 million people. [1 8] The United States Census Bureau estimates that the New York City metropolitan area has a population of 2 0.9 million, making it the most populous metropolitan area in the United States. [1 9] The New York City metropolitan area is the fourth-largest in the United States by population, behind Los Angeles
8.4 9 1 million according to this source.<0x0A><0x0A>https://en.wikipedia.org/wiki/New _ York _ City Guided top-p (=0.1)
8.4 9 1 million people live in New York City [x2]
8.4 9 1 million people [x4]
8,4 9 1,0 7 9 people live in New York City
8.4 9 1 million people live in NYC.
About 8.4 million people (8,4 9 1,0 7 9) live in the city, making it one of the most densely populated cities in the world and one of the most populous in the United States.
How to find out

Table 2: Prediction examples of the QReCC conversational QA task.

Query Fact: Iglesias starred alongside Antonio Banderas , Salma Hayek and Johnny Depp in the Robert Rodriguez film Once Upon a Time in Mexico , in which he played the well-spoken gun-wielding Lorenzo. [57] In 2007, he had a guest appearance in the TV comedy Two and a Half Men as a carpenter/handyman. He also guest starred as Gael, an Argentinean guitar playing/surfer/massage therapist love interest of Robin in season 3 of the CBS hit TV show How I Met Your Mother . [58] Iglesias also played the part of an evil Roman emperor in a Pepsi advertising in 2004. [59] He’s also starred in commercials for Tommy Hilfiger , Doritos and for Viceroy watches. Personal life [ edit ] Anna Kournikova In late 2001, Enrique Iglesias started a relationship with Russian tennis player Anna Kournikova . They reportedly split in October 2013 but have since reconciled. [60] The couple had twins, Nicholas and Lucy, on Saturday, 16 December 2017, at South Miami Hospital. [61] In 2003, Iglesias received surgery to remove a circular mole from the right side of his face, citing concerns that over time it could become cancerous. [62] Iglesias lives an active lifestyle and loves playing sports, surfing, and swimming in the ocean. In a 2014 interview, he stated: "I actually don’t have a specific routine. I mostly do push-ups and bench presses when I am at home. When I am on the road, I go to the gym to exercise. My shows are very active and provide a lot of cardio exercise." [63]

[eot] 0 -1 0 Where was Enrique Iglesias born? [eot] 1 0 1 Enrique Iglesias was born in Madrid, Spain. [eot] 2 1 0 Who were Enrique Iglesias’ parents? [eot] 3 2 1 Spanish singer Julio Iglesias and Filipina socialite and magazine journalist Isabel Preysler [eot] 4 3 0 Was Enrique Inglesias married? [eot] 5 4 1 In late 2001, Enrique Iglesias started a relationship with Russian tennis player Anna Kournikova. [eot] 6 5 0 Did Enrique Iglesias have any children? [eot] 7 6 1 Greedy algorithm

Yes, he has two children. [x8]
Yes, he has two children with Anna Kour nikova. [x2] Baseline top-p (=0.3)
Yes, he has two children. [x3]
Yes, he has two children with Anna Kour nikova. [x5]
He has two children, Nicholas and Lucy.
He has two children with Anna Kour nikova, Nicholas and Lucy. Baseline top-p (=0.7)
Yes, he has two children with Anna Kour nikova. [x2]
He has two children.
He has a daughter, Natalia and a son, Nicholas.
He has a daughter named Natalia.
He had twins with his wife Anna, Nicholas and Lucy.
Two sons: Nicholas (born 2 0 0 3) and Lucy (born 2 0 0 6) with his Russian tennis player girlfriend Anna Kour nikova.
Yes, two. Nicholas and Lucy.
Two daughters, Maria Isabel and Sofia.
Two children: Nicholas and Lucy Guided top-p (=0.1)
Yes. [x2]
Yes, he has two children with Anna Kour nikova. [x2]
Yes, he had two children with Anna Kour nikova. [x3]
Yes, he did, he had two children with Anna Kour nikova.
Yes, he has two children, Nicholas and Lucy. [x2]

Table 3: Prediction examples of the QReCC conversational QA task.

Appendix B Examples from XLSum

According to Figure 9, we choose algorithms guided top-p (σ 𝜎\sigma italic_σ=1.0 1.0 1.0 1.0), baseline top-p (=0.4 0.4 0.4 0.4), and baseline top-p (=0.7 0.7 0.7 0.7) for comparison. For each example, we decode the same input 5 5 5 5 times. See Table 4, Table 5, and Table 6.

Query Farmer Robert Fidler 66, built the four-bedroom home near Redhill, Surrey, without planning permission. He was told on Monday he would be jailed for his "defiance" if the property was not demolished by June. Reigate and Banstead Borough Council said it would send in the bulldozers if Mr Fidler did not knock down the house. Mr Fidler would have to repay the council’s costs. "We don’t have choice - ultimately we have to ensure that these notices are complied with," said planning committee chairman Mark Brunt. "On 6 June we expect the site to be clear and the building to be gone." Mr Fidler told the High Court he had sold the house at Honeycrock Farm in Salfords to an Indian businessman and that the injunction ordering demolition was invalid. The house was built on greenbelt land between 1999 and 2006. The council first ordered its demolition in 2007. Mr Fidler claimed the council wanted to destroy his life and that its case was based on lies and deception. The council said there was another house on the farm where Mr Fidler and his family used to live. "I urge him to come and continue to talk to the council and look at realistic options for providing accommodation for his family on the site," said Mr Brunt. Planning law expert Martin Goodall said Mr Fidler had reached the end of the road and would have to demolish the house. "There are very strong policies against building isolated houses in the greenbelt or open countryside and local authorities cannot allow it," he said. Greedy algorithm

A council has ordered demolition of a house in the greenbelt of Surrey. [x5] Baseline top-p (=0.4)
A council has ordered demoli tion of a house on green belt land in Hampshire.
A farmer has told the High Court he would be ja iled if he did not dem olish his house.
A council has told a farmer to dem olish his house on green belt land.
A farmer has told the High Court he would be ja iled if he did not dem olish his house on green belt land.
A farmer has taken legal action against a council which has ordered his house to be dem olished. Baseline top-p (=0.7)
A farmer in Hampshire has been told he would be ja iled for his "de fianc e" if he did not dem olish his house.
An Isle of Man farmer has been told he would be ja iled if his green belt home was not dem olished by a council.
A council has given the land lord notice to dem olish his house after it ordered it to be dem olished.
Resident s of a farm in south London have been told the y have to dem olish a house that the y believe d was built without planning permission.
A man has be gun a legal battle against a council because he had been told to dem olish his house on green belt land. Guided top-p (=1.0)
A council has order ed demoli tion of a house in the green belt of Sur rey.
A council has been told to send bull do zers to dem olish a house on green belt land.
A council has warn ed a farmer to dem olish his house on green belt land.
A council has ordered demoli tion of a house that is on green belt land.
A council has told a farmer that he would be ja iled if he did not dem olish his house in a green belt.

Table 4: Prediction examples of the XLSum summarization task.

Query A total of 84 people died in the last 24 hours, Governor Andrew Cuomo said on Saturday, compared with 109 a day before. During the height of the outbreak in April, more than 1,000 people a day were losing their lives in worst-hit US state. "In my head, I was always looking to get under 100," Mr Cuomo said. "It doesn’t do good for any of those 84 families that are feeling the pain," he said at his daily briefing, but added that the drop was a sign of "real progress". Mr Cuomo announced on Friday that groups of up to 10 people could gather "for any lawful purpose" anywhere in the state, including New York City. But, he added: "If you don’t have to be with a group of 10 people don’t be with a group of 10 people." New York state was once the epicentre of the US coronavirus outbreak, with more than 28,000 deaths, according to Johns Hopkins University. The US has the biggest death toll from Covid-19 at 96,000. The UK is second with more than 36,000. Greedy algorithm

New York state has fallen below 100 deaths a day in the coronavirus outbreak, according to the governor. [x5] Baseline top-p (=0.4)
New York state has fallen short of the 1,000 people a day which the state was once the epi centre of the coronavirus out break.
New York state has fallen short of the number of people killed by coronavirus in the last 24 hours,a few days after the highest number of death s.
New York state has fallen below 100 people a day in the last 24 hours during the coronavirus pandemi c, the govern or said.
A significant drop in death s in New York state has been reported,a chief executive said.
New York state has been slow ed down in the number of death s from coronavirus during the last 24 hours,according to the govern or. Baseline top-p (=0.7)
New York State has had a significant drop in death s from coronavirus in the last 24 hours,according to the state govern or.
New York state has had a drop in death s from coronavirus, with the state leadership say ing it has a "real progress" to be under 100.
New York state has fallen to the lowest number of death s from coronavirus- in the last 24 hours,according to the govern or.
New York State’s death toll from Covid-19 has fallen below 100, the govern or said.
A de crease in death s from Covid-19 cases in New York State is a sign of "real progress" by a govern or,according to the agency CDC. Guided top-p (=1.0)
New York state has fallen below 100 death s from coronavirus during the last 24 hours,a significant drop from the previous day, the govern or said.
New York state has fallen short of the highest death toll during the coronavirus pandemi c.
New York state has fallen below 100 death s in the last 24 hours,a record low for the state’s govern or.
New York state has fallen below 100 death s from coronavirus in the last 24 hours,according to a govern or.
New York state has fallen below 100 death s per day from coronavirus since the height of the out break, the govern or said.

Table 5: Prediction examples of the XLSum summarization task.

Query The ticket from the Euromillions draw on Tuesday, 3 December, worth £40,957,696.60, was bought somewhere in Dorset, the National Lottery said. The operator urged players to "check, double-check and triple-check" their tickets. The winning numbers were 18, 31, 32, 38 and 48 with 4 and 12 as lucky stars. National Lottery spokesman Patrick Lisoire said it was "not very common" to have such a large unclaimed prize. "People are predisposed to checking the tickets they’ve bought, but for whatever reason - the busy lives we lead, or the lead-up to Christmas - somebody hasn’t quite got round to checking their tickets." He said the exact location of where the winning ticket was bought was not being revealed to protect the winner’s identity. It is the seventh Euromillions jackpot won in the UK this year. The jackpot must be claimed within six months or the prize money, plus all interest generated, will go to help National Lottery-funded projects across the UK. Tickets for Euromillions are sold in nine countries - the UK, France, Spain, Austria, Belgium, Luxembourg, the Irish Republic, Portugal and Switzerland - with ticket-holders in all those countries trying to win a share of the same jackpot each week. Greedy algorithm

A jackpot worth £40m has been unclaimed by players in the UK, the National Lottery has said. [x5] Baseline top-p (=0.4)
A jackpot ticket has been un claim ed from a jackpot draw in the UK, the National L ottery has said.
A jackpot worth £4 0,95 7,696.60 has been un claim ed by players in the UK, the National L ottery has said.
A jackpot worth £4 m has been un claim ed by players in the UK.
A jackpot worth £4.4 m has been un claim ed by players in the UK.
A jackpot ticket has been un claim ed in the UK after a lotter y ticket was b ought in Dor set. Baseline top-p (=0.7)
A jackpot in the UK has been un claim ed,according to the National L ottery.
A winning ticket from the UK Euro millions jackpot has been un claim ed,according to the National L ottery.
A jackpot from the UK’s largest lotter y has been un claim ed, the operator has said.
The jackpot from the Euro millions jackpot draw has been un claim ed after a winning ticket was sold in the UK, the National L ottery has said.
The jackpot earned by a ticket holder has been claim ed during the Christmas period. Guided top-p (=1.0)
A jackpot worth £4 0,95 7,696.60 has been un claim ed in the UK, the National L ottery has said.
A jackpot worth £40 m has been un claim ed in the UK, the National L ottery has said.
A jackpot worth £4 0,95 7,696.60 has been un claim ed after a ticket was b ought in Dor set, the National L ottery said.
A jackpot worth £4.8 m has been un claim ed after a ticket was b ought in the UK, the National L ottery said.
A jackpot worth £400,000 has been un claim ed from a ticket b ought in Dor set.

Table 6: Prediction examples of the XLSum summarization task.

Xet Storage Details

Size:: 65.4 kB
Xet hash:: fee11883a0bdfc7b9bdc779783d76f825a09b50759bb193226afe7442968851d

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.