Buckets:
Title: SSP: Self-Supervised Post-training for Conversational Search
URL Source: https://arxiv.org/html/2307.00569
Markdown Content: Quan Tu 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Shen Gao 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1 1 footnotemark: 1, Xiaolong Wu 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Zhao Cao 4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Ji-Rong Wen 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT, Rui Yan 1,3 1 3{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT
1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Gaoling School of Artificial Intelligence, Renmin University of China
2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT School of Computer Science and Technology, Shandong University
3 3{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Engineering Research Center of Next-Generation Intelligent
Search and Recommendation, Ministry of Education
4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Huawei Poisson Lab
1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT{quantu,jrwen,ruiyan}@ruc.edu.cn, 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT shengao@pku.edu.cn
4 4{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT{wuxiaolong19, caozhao1}@huawei.com
Abstract
Conversational search has been regarded as the next-generation search paradigm. Constrained by data scarcity, most existing methods distill the well-trained ad-hoc retriever to the conversational retriever. However, these methods, which usually initialize parameters by query reformulation to discover contextualized dependency, have trouble in understanding the dialogue structure information and struggle with contextual semantic vanishing. In this paper, we propose S elf-S upervised P ost-training (SSP) which is a new post-training paradigm with three self-supervised tasks to efficiently initialize the conversational search model to enhance the dialogue structure and contextual semantic understanding. Furthermore, the SSP can be plugged into most of the existing conversational models to boost their performance. To verify the effectiveness of our proposed method, we apply the conversational encoder post-trained by SSP on the conversational search task using two benchmark datasets: CAsT-19 and CAsT-20. Extensive experiments that our SSP can boost the performance of several existing conversational search methods. Our source code is available at https://github.com/morecry/SSP.
1 Introduction
The past years have witnessed the fast progress of the ad-hoc search Dai and Callan (2020); Dai et al. (2018); Fujiwara et al. (2013); Gao et al. (2019). However, when it confronts more complicated information needs, the traditional ad-hoc search seems to be less competent. Recently, researchers proposes conversational search which is the combination of the search engine and the conversational assistant Radlinski and Craswell (2017); Zhang et al. (2018); Kiesel et al. (2021); Trippas et al. (2020); Tu et al. (2022). Different from the keyword-based query in the ad-hoc search, multi-turn natural language utterance is the main interactive form in the conversational search. This yields the challenge of developing the conversational search system that existing ad-hoc retrievers and datasets cannot be directly used to derive the conversational query understanding module.
Figure 1: Example of modeling the conversational structure in conversational search. The model should capture the structure including the topic has been shifted at the 3rd utterance and the last utterance has coreference with the previous utterance. This information can help the model understand the search intent of users accurately.
In the beginning, researchers reformulate a conversational query to a de-contextual query, which is used to perform ad-hoc retrieval Lin et al. (2020b); Mele et al. (2021); Lin et al. (2021b). Recently the conversational dense retrieval model Lin et al. (2021a); Mao et al. (2022) is presented to directly encode the whole multi-turn conversational context as a vector representation and conduct matching with the candidate document representations. Since the real-world conversational search corpus is hard to collect, a warm-up step is additionally employed to initialize the conversational representation ability Yu et al. (2021); Dai et al. (2022). These conversational dense retrieval methods have achieved significantly better performance than the query reformulation methods and have been widely adopted in research of conversational search Yu et al. (2021); Dai et al. (2022). However, these warm-up methods just use the same training objective on a large dataset from other domains to initialize the parameters of the conversational encoder, which can hardly capture the structure information of the conversation which is essential for understanding the user’s search intent accurately.
In this paper, we propose S elf-S upervised P ost-training (SSP) for the conversational search task as shown in Figure1. In SSP, we replace the commonly used warm-up step with a new post-training paradigm which contains three novel self-supervised tasks to learn how to capture the structure information and keep contextual semantics. To be more specific, the first self-supervised task is topic segmentation, which learns to decompose the dialogue structure into several segments based on the topic. To tackle the coreference problem which is a ubiquitous problem of multi-turn conversation modeling, we propose the coreference identification task which helps the model identify the most possible referred terms in the context and simplifies the intricate dialogue structure. Since understanding and remembering the semantic information in the conversational context is vital for conversational context modeling, we propose the word reconstruction task which prevents contextual semantic vanishing. To demonstrate the effectiveness of SSP, we first equip several existing conversational search methods with SSP and conduct experiments on two benchmark datasets: CAsT-19 Dalton et al. (2020) and CAsT-20 Dalton et al. (2021). Experimental results demonstrate that the SSP outperforms all the strong baselines on 2 2 2 2 datasets.
To sum up, our contributions can be summarized as follows:
∙∙\bullet∙ We propose a general and extensible post-training framework to better initialize the conversational context encoder in the existing conversational search models.
∙∙\bullet∙ We propose three specific self-supervised tasks which help the model to capture the conversational structure information and prevent the contextual semantics from vanishing.
∙∙\bullet∙ Experiments show that our SSP can boost the performance of strong conversational search methods on two benchmark datasets and achieves state-of-the-art performance.
2 Related Work
Conversational search has become a hop research topic in recent years. TREC Conversational Assistant Track (CAsT) competition Dietz et al. (2017), which holds the benchmark largely promotes the progress of conversational search. In the beginning, researchers simply view conversational search as the query reformulation problem. They suppose that if a context-dependent query could be rewritten to a de-contextualized query based on historical queries, then it directly uses the well-trained ad-hoc retriever to obtain retrieval results. Transformer++Vakulenko et al. (2021) fine-tunes the GPT-2 on query reformulation dataset CANARD Elgohary et al. (2019) to rewrite query. QueryRewriter Yu et al. (2020) exploits large amounts of ad-hoc search sessions to build a weak-supervision query reformulation data generator, then these automatically generated data is used to fine-tune the language model. However, these methods underestimate the value of context, which contains various latent search intentions and topic information.
After that, the conversational dense retriever is proposed. It straightly encodes full conversation whose last query denotes the user’s real search intention to dense representation. ConvDR Yu et al. (2021) forces the contextual representation to mimic the reformulation query representation based on the teacher-student framework, which slightly deals with the conversational search data scarcity problem. Further, COTED Mao et al. (2022) points out that not all queries in context are useful and devises a curriculum denoising method to inhibit the influence of unnecessary contextual queries. These dense methods additionally perform the warm-up on the other domain dataset to initialize the parameters based on their own objective. However, their warm-up ignore the conversation structure information, which is crucial for capturing the relationship between utterances and understanding the search intention of the user. In this respect, we devise a novel S elf-S upervised P ost-training (SSP) to replace the warm-up as Figure2.
Figure 2: The comparison between the training procedure of conversational search with warm-up and the SSP paradigm.
3 Problem formulation
We assume that there is a multi-turn search conversation Q={q 1,q 2,…,q n}𝑄 subscript 𝑞 1 subscript 𝑞 2…subscript 𝑞 𝑛 Q={q_{1},q_{2},\dots,q_{n}}italic_Q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where q i={x i,1,x i,2,…,x i,l i}subscript 𝑞 𝑖 subscript 𝑥 𝑖 1 subscript 𝑥 𝑖 2…subscript 𝑥 𝑖 subscript 𝑙 𝑖 q_{i}={x_{i,1},x_{i,2},\dots,x_{i,l_{i}}}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } represents the i 𝑖 i italic_i-th question in the conversation and x i,j subscript 𝑥 𝑖 𝑗 x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the j 𝑗 j italic_j-th token in q i subscript 𝑞 𝑖 q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The last query q n subscript 𝑞 𝑛 q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the user’s real search intention. We insert special tokens [𝙲𝙻𝚂]delimited-[]𝙲𝙻𝚂\mathtt{[CLS]}[ typewriter_CLS ] and [𝚂𝙴𝙿]delimited-[]𝚂𝙴𝙿\mathtt{[SEP]}[ typewriter_SEP ] in Q 𝑄 Q italic_Q yielding {𝙲𝙻𝚂,q 1,[𝚂𝙴𝙿],q 2,[𝚂𝙴𝙿],…,[𝚂𝙴𝙿],q n}𝙲𝙻𝚂 subscript 𝑞 1 delimited-[]𝚂𝙴𝙿 subscript 𝑞 2 delimited-[]𝚂𝙴𝙿…delimited-[]𝚂𝙴𝙿 subscript 𝑞 𝑛{\mathtt{CLS},q_{1},\mathtt{[SEP]},q_{2},\allowbreak\mathtt{[SEP]},\dots,% \mathtt{[SEP]},q_{n}}{ typewriter_CLS , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , [ typewriter_SEP ] , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , [ typewriter_SEP ] , … , [ typewriter_SEP ] , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } as the model input, where [𝙲𝙻𝚂]delimited-[]𝙲𝙻𝚂\mathtt{[CLS]}[ typewriter_CLS ] is the start token and [𝚂𝙴𝙿]delimited-[]𝚂𝙴𝙿\mathtt{[SEP]}[ typewriter_SEP ] is the separation token to split each query. After the concatenation of all queries is sent into the conversational encoder (a transformer-based architecture model), we obtain the last layer’s output hidden state E 𝐸 E italic_E. E[𝙲𝙻𝚂]subscript 𝐸 delimited-[]𝙲𝙻𝚂 E_{\mathtt{[CLS]}}italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT and E[𝚂𝙴𝙿]subscript 𝐸 delimited-[]𝚂𝙴𝙿 E_{\mathtt{[SEP]}}italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT are the corresponding representations of [𝙲𝙻𝚂]delimited-[]𝙲𝙻𝚂\mathtt{[CLS]}[ typewriter_CLS ] and [𝚂𝙴𝙿]delimited-[]𝚂𝙴𝙿\mathtt{[SEP]}[ typewriter_SEP ] and will be used in self-supervised tasks. Our goal is to learn a better contextual representation E[𝙲𝙻𝚂]subscript 𝐸 delimited-[]𝙲𝙻𝚂 E_{\mathtt{[CLS]}}italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT in order to accurately retrieve documents in corpus for the last query q n subscript 𝑞 𝑛 q_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.
4 Self-Supervised Post-training
Figure 3: Overview of SSP. It consists of three self-supervised tasks to conduct post-training of conversational encoder: (1) Topic Segmentation predicts which utterances are the randomly sampled perturbation utterances from other conversation sessions; (2) Coreference Identification predicts which utterance in the conversational context is related to the last utterance; (3) Word Reconstruction uses the conversational context vector representation to reconstruct the Bag-of-Word vector of conversational context.
4.1 Overview
In this section, we propose our S elf-S upervised P ost-training, abbreviated as SSP. An overview of SSP is shown in Figure3, which consists of three self-supervised tasks:
∙∙\bullet∙Topic Segmentation Task aims to find the topic-shifting point in the utterances. It helps the model to capture the topic structure in the conversational context.
∙∙\bullet∙Coreference Identification Task aims to identify the correlation structure between two referred utterances, which helps the conversational encoder to understand the coreference relationship and produce better query representation.
∙∙\bullet∙Word Reconstruction Task aims to reconstruct the bag-of-word (BOW) vector of the conversational context using the conversational vector representation. It helps the model avoid the contextual semantic vanishing during conversation encoding.
After jointly training the conversational encoder using these three self-supervised tasks, we fine-tune the encoder to the conversational search downstream task using the existing conversational search methods.
4.2 Topic Segmentation Task
When the user interacts with the conversational search system, the focused topic may vary from time to time. Taking the example in Figure1, the search intention of the user changes according to the retrieval results of previous turns. This causes the topic of the conversation to shift. Since the conversation topic may shift in every utterance, to fully understand a user query, the conversational system should know what is the current topic of this query and view the utterances of the current topic as a more salient context. If the conversational encoder cannot identify the topic boundary of the current topic, it may focus on unrelated utterances and incorporate noise information into the query representation.
Thus we propose the topic segmentation task to identify the topic boundary of the conversation, which can facilitate the model to focus on more related context when encoding the query. We first randomly sample a noise conversational session with several utterances from the training corpus and then concatenate this sampled noise session at the beginning of the raw conversational context. Given the raw search conversation Q={q 1,q 2,…,q n}𝑄 subscript 𝑞 1 subscript 𝑞 2…subscript 𝑞 𝑛 Q={q_{1},q_{2},\dots,q_{n}}italic_Q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and the noisy conversation Q′={q 1′,q 2′,…,q m′}superscript 𝑄′subscript superscript 𝑞′1 subscript superscript 𝑞′2…subscript superscript 𝑞′𝑚 Q^{\prime}={q^{\prime}{1},q^{\prime}{2},\dots,q^{\prime}_{m}}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, we truncate the first k 𝑘 k italic_k queries of Q′superscript 𝑄′Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where k 𝑘 k italic_k is sampled based on reciprocal probability distribution p 𝑝 p italic_p, which avoids the distortion of the raw context from the abundant long noisy sessions,
p k=1 k/∑i=1 m 1 i,k=1,2,…,m.formulae-sequence subscript 𝑝 𝑘 1 𝑘 superscript subscript 𝑖 1 𝑚 1 𝑖 𝑘 1 2…𝑚 p_{k}={\frac{1}{k}}/{\sum_{i=1}^{m}\frac{1}{i}},{k=1,2,\dots,m}.italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG / ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_i end_ARG , italic_k = 1 , 2 , … , italic_m .
After concatenating the sampled noise session before the raw context and separating each query by [𝚂𝙴𝙿]delimited-[]𝚂𝙴𝙿\mathtt{[SEP]}[ typewriter_SEP ], we obtain the perturbed conversation Q ˇ={[𝙲𝙻𝚂],q 1′,[𝚂𝙴𝙿],…,q k′,[𝚂𝙴𝙿],q 1,[𝚂𝙴𝙿],…,q n}ˇ 𝑄 delimited-[]𝙲𝙻𝚂 subscript superscript 𝑞′1 delimited-[]𝚂𝙴𝙿…subscript superscript 𝑞′𝑘 delimited-[]𝚂𝙴𝙿 subscript 𝑞 1 delimited-[]𝚂𝙴𝙿…subscript 𝑞 𝑛\check{Q}={\mathtt{[CLS]},q^{\prime}{1},\mathtt{[SEP]},\allowbreak\dots,q^{% \prime}{k},\mathtt{[SEP]},q_{1},\mathtt{[SEP]},\dots,q_{n}}overroman_ˇ start_ARG italic_Q end_ARG = { [ typewriter_CLS ] , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , [ typewriter_SEP ] , … , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , [ typewriter_SEP ] , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , [ typewriter_SEP ] , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and the ground truth topic label y t={1,…,1,0,…,0}superscript 𝑦 𝑡 1…1 0…0 y^{t}={1,\dots,1,0,\dots,0}italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { 1 , … , 1 , 0 , … , 0 }, where the queries from the external conversation are labelled as 1 1 1 1 and the ones from the raw conversation are labelled as 0 0.
Next, we use the perturbed conversation Q ˇ ˇ 𝑄\check{Q}overroman_ˇ start_ARG italic_Q end_ARG as input to the conversational encoder, and obtain the vector representation E ˇ={E[𝙲𝙻𝚂],E 1′,E[𝚂𝙴𝙿],…,E k′,E[𝚂𝙴𝙿],E 1,E[𝚂𝙴𝙿],…,E n}ˇ 𝐸 subscript 𝐸 delimited-[]𝙲𝙻𝚂 subscript superscript 𝐸′1 subscript 𝐸 delimited-[]𝚂𝙴𝙿…subscript superscript 𝐸′𝑘 subscript 𝐸 delimited-[]𝚂𝙴𝙿 subscript 𝐸 1 subscript 𝐸 delimited-[]𝚂𝙴𝙿…subscript 𝐸 𝑛\check{E}={E_{\mathtt{[CLS]}},E^{\prime}{1},E{\mathtt{[SEP]}},\dots,% \allowbreak E^{\prime}{k},E{\mathtt{[SEP]}},E_{1},\allowbreak E_{\mathtt{[% SEP]}},\dots,E_{n}}overroman_ˇ start_ARG italic_E end_ARG = { italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT , … , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } of the perturbed conversation Q ˇ ˇ 𝑄\check{Q}overroman_ˇ start_ARG italic_Q end_ARG. Finally, E[𝚂𝙴𝙿]subscript 𝐸 delimited-[]𝚂𝙴𝙿 E_{\mathtt{[SEP]}}italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT is sent to the topic predictor (a linear layer) to decide whether an utterance is from the sampled noise conversation Q′superscript 𝑄′Q^{\prime}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or not. The binary cross entropy is used to compute topic segmentation loss ℒ TS subscript ℒ 𝑇 𝑆\mathcal{L}_{TS}caligraphic_L start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT:
p(y i t=1|Q ˇ)𝑝 subscript superscript 𝑦 𝑡 𝑖 conditional 1 ˇ 𝑄\displaystyle p(y^{t}{i}=1|\check{Q})italic_p ( italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | overroman_ˇ start_ARG italic_Q end_ARG )=Sigmoid(W tE[𝚂𝙴𝙿]+b t),absent Sigmoid subscript 𝑊 𝑡 subscript 𝐸 delimited-[]𝚂𝙴𝙿 subscript 𝑏 𝑡\displaystyle=\text{Sigmoid}(W{t}E_{\mathtt{[SEP]}}+b_{t}),= Sigmoid ( italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ℒ TS subscript ℒ 𝑇 𝑆\displaystyle\mathcal{L}{TS}caligraphic_L start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT=−y i tlog(p(y i t=1|Q ˇ))absent subscript superscript 𝑦 𝑡 𝑖 𝑝 subscript superscript 𝑦 𝑡 𝑖 conditional 1 ˇ 𝑄\displaystyle=-y^{t}{i}\log(p(y^{t}{i}=1|\check{Q}))= - italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p ( italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | overroman_ˇ start_ARG italic_Q end_ARG ) ) −(1−y i t)(1−log(p(y i t=1|Q ˇ))),1 subscript superscript 𝑦 𝑡 𝑖 1 𝑝 subscript superscript 𝑦 𝑡 𝑖 conditional 1 ˇ 𝑄\displaystyle-(1-y^{t}{i})(1-\log(p(y^{t}_{i}=1|\check{Q}))),- ( 1 - italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - roman_log ( italic_p ( italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | overroman_ˇ start_ARG italic_Q end_ARG ) ) ) ,
where W t∈𝐑 h×1,b t∈𝐑 formulae-sequence subscript 𝑊 𝑡 superscript 𝐑 ℎ 1 subscript 𝑏 𝑡 𝐑 W_{t}\in\mathbf{R}^{h\times 1},b_{t}\in\mathbf{R}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_h × 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R, h ℎ h italic_h is the hidden size of model.
4.3 Coreference Identification Task
In conversational search, a common problem is the coreference, which is that the pronoun in a query usually refers to a term in its previous queries. Most of the existing methods did not explicitly train the model to tackle this problem. Here, we devise an auxiliary self-supervised task that trains the model to predict the referred utterance of the last utterance by the coreference relationship. To obtain which utterance in the conversational context has the coreference relationship with the last utterance, we use the query reformulation corpus to find. We compare the last query in Q 𝑄 Q italic_Q with the reformulated query q nsubscript superscript 𝑞 𝑛 q^{}_{n}italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT by set operations to find the reformulation terms r 𝑟 r italic_r have been omitted in Q 𝑄 Q italic_Q:
r=𝒮(tokenize(q n*))−𝒮(tokenize(q n)),𝑟 𝒮 tokenize subscript superscript 𝑞 𝑛 𝒮 tokenize subscript 𝑞 𝑛 r=\mathcal{S}(\text{tokenize}(q^{*}{n}))-\mathcal{S}(\text{tokenize}(q{n})),italic_r = caligraphic_S ( tokenize ( italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) - caligraphic_S ( tokenize ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ,
where 𝒮 𝒮\mathcal{S}caligraphic_S is a set operation that converts a sentence into a non-repeating word set. We can obtain the reformulation terms r 𝑟 r italic_r by calculating the difference set between two sets. Then r 𝑟 r italic_r will be used to locate the referred query from back to front until the first query containing the r 𝑟 r italic_r is found. We mark the position of the referred query to the label y c={0,0,…,1,…,0}superscript 𝑦 𝑐 0 0…1…0 y^{c}={0,0,\dots,1,\dots,0}italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = { 0 , 0 , … , 1 , … , 0 }, whose i 𝑖 i italic_i-th value is 1 only if the i 𝑖 i italic_i-th query is the referred query. Similar to the topic segmentation task (introduced in §4.2), we send E[𝚂𝙴𝙿]subscript 𝐸 delimited-[]𝚂𝙴𝙿 E_{\mathtt{[SEP]}}italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT into a coreference predictor to predict the referred query and use the binary cross-entropy as the loss function of this task:
p(y i c=1|Q)𝑝 subscript superscript 𝑦 𝑐 𝑖 conditional 1 𝑄\displaystyle p(y^{c}{i}=1|Q)italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_Q )=Sigmoid(W cE[𝚂𝙴𝙿]+b c),absent Sigmoid subscript 𝑊 𝑐 subscript 𝐸 delimited-[]𝚂𝙴𝙿 subscript 𝑏 𝑐\displaystyle=\text{Sigmoid}(W{c}E_{\mathtt{[SEP]}}+b_{c}),= Sigmoid ( italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , ℒ CI subscript ℒ 𝐶 𝐼\displaystyle\mathcal{L}{CI}caligraphic_L start_POSTSUBSCRIPT italic_C italic_I end_POSTSUBSCRIPT=−y i clog(p(y i c=1|Q))absent subscript superscript 𝑦 𝑐 𝑖 log 𝑝 subscript superscript 𝑦 𝑐 𝑖 conditional 1 𝑄\displaystyle=-y^{c}{i}\text{log}(p(y^{c}{i}=1|Q))= - italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT log ( italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_Q ) ) −(1−y i c)(1−log(p(y i c=1|Q))),1 subscript superscript 𝑦 𝑐 𝑖 1 log 𝑝 subscript superscript 𝑦 𝑐 𝑖 conditional 1 𝑄\displaystyle-(1-y^{c}{i})(1-\text{log}(p(y^{c}_{i}=1|Q))),- ( 1 - italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - log ( italic_p ( italic_y start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_Q ) ) ) ,
where W r∈𝐑 h×1,b r∈𝐑 formulae-sequence subscript 𝑊 𝑟 superscript 𝐑 ℎ 1 subscript 𝑏 𝑟 𝐑 W_{r}\in\mathbf{R}^{h\times 1},b_{r}\in\mathbf{R}italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_h × 1 end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ bold_R are all trainable parameters. With the coreference identification task, the conversational encoder will pay more attention to the most possible referred query in context when it understands the last query.
4.4 Word Reconstruction Task
The duality of a one-stage conversational retriever will encode a query to a dense vector. In the previous sections, we use the self-supervised tasks to focus on the utterance of the current topic and the highly related utterance with coreference. However, other utterances may also provide useful information to understand the current search intent. Thus, the conversational encoder should not only gather information from the related utterances but also keep the information from the whole conversational context.
To avoid the information vanishing in the final conversational vector representation, we propose to use a simple but efficient reconstruction task to help the conversational encoder to keep the overall semantic information. In this task, we train the model to reconstruct the bag-of-words (BOW) vector of the whole conversation using the representation of [𝙲𝙻𝚂]delimited-[]𝙲𝙻𝚂\mathtt{[CLS]}[ typewriter_CLS ] produced by the conversational encoder. Specifically, all of the words appearing in the context are converted to a BoW vector y w superscript 𝑦 𝑤 y^{w}italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT,
y w=BOW(𝒮(tokenize(Q))),superscript 𝑦 𝑤 BOW 𝒮 tokenize 𝑄 y^{w}=\text{BOW}(\mathcal{S}(\text{tokenize}(Q))),italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT = BOW ( caligraphic_S ( tokenize ( italic_Q ) ) ) ,
where the length of y w superscript 𝑦 𝑤 y^{w}italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is the vocab size and y i w=1 subscript superscript 𝑦 𝑤 𝑖 1 y^{w}{i}=1 italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 only if the i 𝑖 i italic_i-th word in vocab appears in the context otherwise y i w=0 subscript superscript 𝑦 𝑤 𝑖 0 y^{w}{i}=0 italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. We use a linear layer after the last layer of the model to process E[𝙲𝙻𝚂]subscript 𝐸 delimited-[]𝙲𝙻𝚂 E_{\mathtt{[CLS]}}italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT and optimize the WR loss based on mean squared error,
y^w superscript^𝑦 𝑤\displaystyle\hat{y}^{w}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT=Sigmoid(W wE[𝚂𝙴𝙿]+b w),absent Sigmoid subscript 𝑊 𝑤 subscript 𝐸 delimited-[]𝚂𝙴𝙿 subscript 𝑏 𝑤\displaystyle=\text{Sigmoid}(W_{w}E_{\mathtt{[SEP]}}+b_{w}),= Sigmoid ( italic_W start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT [ typewriter_SEP ] end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , ℒ WR subscript ℒ 𝑊 𝑅\displaystyle\mathcal{L}{WR}caligraphic_L start_POSTSUBSCRIPT italic_W italic_R end_POSTSUBSCRIPT=‖y^w−y w‖2,absent subscript norm superscript^𝑦 𝑤 superscript 𝑦 𝑤 2\displaystyle=\left|\hat{y}^{w}-y^{w}\right|{2},= ∥ over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
where W w∈𝐑 h×|V|,b w∈𝐑|V|formulae-sequence subscript 𝑊 𝑤 superscript 𝐑 ℎ 𝑉 subscript 𝑏 𝑤 superscript 𝐑 𝑉 W_{w}\in\mathbf{R}^{h\times|V|},b_{w}\in\mathbf{R}^{|V|}italic_W start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_h × | italic_V | end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT, |V|𝑉|V|| italic_V | is the vocab size, ∥⋅∥\left|\cdot\right|∥ ⋅ ∥ means euclidean distance.
4.5 Optimization
Inspired from the previous studies Yu et al. (2021); Mao et al. (2022), we also employ the knowledge distillation objective in SSP to accelerate the learning process. Specifically, a pre-trained ad-hoc search encoder TEnc which uses the de-contextualized query as the input and produce the vector representation. We use TEnc as the teacher model and employ a knowledge distillation loss function to train our conversational encoder to mimic the vector representation produced by the teacher encoder TEnc. We formulate the knowledge distillation loss ℒ KD subscript ℒ 𝐾 𝐷\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT as follows:
E[𝙲𝙻𝚂]subscript superscript 𝐸 delimited-[]𝙲𝙻𝚂\displaystyle E^{}{\mathtt{[CLS]}}italic_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT=TEnc({[𝙲𝙻𝚂],q n*})[𝙲𝙻𝚂]absent TEnc subscript delimited-[]𝙲𝙻𝚂 subscript superscript 𝑞 𝑛 delimited-[]𝙲𝙻𝚂\displaystyle=\text{TEnc}({\mathtt{[CLS]},q^{*}{n}}){\mathtt{[CLS]}}= TEnc ( { [ typewriter_CLS ] , italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT ℒ KD subscript ℒ 𝐾 𝐷\displaystyle\mathcal{L}{KD}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT=‖E[𝙲𝙻𝚂]−E[𝙲𝙻𝚂]‖2.absent subscript norm subscript 𝐸 delimited-[]𝙲𝙻𝚂 subscript superscript 𝐸 delimited-[]𝙲𝙻𝚂 2\displaystyle=\left|E_{\mathtt{[CLS]}}-E^{}{\mathtt{[CLS]}}\right|{2}.= ∥ italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT - italic_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
where the q nsubscript superscript 𝑞 𝑛 q^{}{n}italic_q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the manual rewritten query of q n subscript 𝑞 𝑛 q{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, (⋅)[𝙲𝙻𝚂]subscript⋅delimited-[]𝙲𝙻𝚂(\cdot){\mathtt{[CLS]}}( ⋅ ) start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT means only taking the [𝙲𝙻𝚂]delimited-[]𝙲𝙻𝚂\mathtt{[CLS]}[ typewriter_CLS ] representation of TEnc’s last layer output. We make the representation of conversation E[𝙲𝙻𝚂]subscript 𝐸 delimited-[]𝙲𝙻𝚂 E{\mathtt{[CLS]}}italic_E start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT to approximate the representation of reformulation query E[𝙲𝙻𝚂]subscript superscript 𝐸 delimited-[]𝙲𝙻𝚂 E^{}_{\mathtt{[CLS]}}italic_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ typewriter_CLS ] end_POSTSUBSCRIPT processed by TEnc to distill its powerful retrieval ability.
Finally, we combine all the training objective of each self-supervised task and optimize all the parameters in the conversational encoder:
ℒ final=ℒ KD+αℒ TS+βℒ CI+γℒ WR,subscript ℒ final subscript ℒ 𝐾 𝐷 𝛼 subscript ℒ 𝑇 𝑆 𝛽 subscript ℒ 𝐶 𝐼 𝛾 subscript ℒ 𝑊 𝑅\mathcal{L}{\text{final}}=\mathcal{L}{KD}+\alpha\mathcal{L}{TS}+\beta% \mathcal{L}{CI}+\gamma\mathcal{L}_{WR},caligraphic_L start_POSTSUBSCRIPT final end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_C italic_I end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_W italic_R end_POSTSUBSCRIPT ,
where the ℒ final subscript ℒ final\mathcal{L}_{\text{final}}caligraphic_L start_POSTSUBSCRIPT final end_POSTSUBSCRIPT is the final training objective for SSP, α,β 𝛼 𝛽\alpha,\beta italic_α , italic_β, and γ 𝛾\gamma italic_γ denotes the hyper-parameter as a trade-off between the self-supervised tasks.
5 Experimental Setting
5.1 Datasets
Table 1: The statistics of test dataset for fine-tuning.
Statistics CAsT-19 CAsT-20
Conversations 50 (20)25 (25)
Queries 479 (173)216 (208)
Avg. Query Tokens 6.1 6.8
Avg. Queris / Conversation 9.6 8.6
Documents 38M
For fine-tuning the conversational encoder on the conversational search task, we choose two few-shot datasets to evaluate our proposed model based on K-fold cross-validation.
CAsT-19 Dalton et al. (2020) is the acronym of the TREC Conversational Assistance Track (CAsT) 2019 benchmark dataset. It is built by human annotators who are required to mimic real dialogues under specified topics and contains frequent coreferences, abbreviations, and omissions. In this work, we pay attention to query de-contextualization and but only the test set provides manual oracle de-contextualized queries. Since the queries in TREC CAsT dataset are used in the conversational search fine-tuning phrase, it will cause the data leaking problem. For a fair comparison, we filter the queries from TREC CAsT from QReCC. The statistics of the filtered QReCC dataset are shown in Table5.
CAsT-20 Dalton et al. (2021) refers to next year’s TREC CAsT. Its most obvious modification is that the coreference could appear in the response (a summarized answer of gold passage)compared with CAsT-19, where a query only refers to its previous queries. Both manual response and automatic response (generated by neural rewriter Yu et al. (2020)) are provided in CAsT-20. It contains 216 queries in 25 dialogues which have de-contextualized queries and most of queries have relevance judgments. Additionally, CAsT-20’s corpus is the same as CAsT-19’s. Detailed statistics are shown in Table1.
5.2 Baselines
Following Mao et al. (2022), we split baselines into two categories: sparse retrieval methods and dense retrieval methods respectively. Sparse retrieval methods rewrite the contextualized query to a context-independent query and use the ad-hoc sparse retriever to obtain the results. The dense retrieval methods use the ad-hoc dense retriever or directly encode the conversational queries via a conversational dense retriever.
∙∙\bullet∙Raw denotes simply using the last context-independent query in the dense or sparse retriever to retrieve the documents.
∙∙\bullet∙Tansformer++Vakulenko et al. (2021) is a query rewriting method which inherits from GPT-2 Radford et al. (2019) and fine-tunes on CANARD dataset Elgohary et al. (2019). Then it employs the ad-hoc retriever to search using the rewritten query.
∙∙\bullet∙QueryRewriter Yu et al. (2020) is a data augmentation method that first generates query reformulation data using large amounts of ad-hoc search sessions based on rules and self-supervised learning. Then the automatically generated data is used to train the query rewriter.
∙∙\bullet∙QuReTeC Voskarides et al. (2020) deals with the query reformulation task as a binary term classification problem. It will decide whether to add terms appearing in the dialogue history to the current turn query or not.
∙∙\bullet∙ContQE Lin et al. (2021a) employs a well-trained ad-hoc search encoder TCT-ColBERT Lin et al. (2020a). It uses the mean-pooling method to get the contextual embedding and fine-tunes on pseudo-relevance labels.
∙∙\bullet∙ConvDR Yu et al. (2021) develops the few-shot learning method to train the conversational dense retriever. It takes ANCE Xiong et al. (2020) as the teacher model to teach the conversational student model. Integrating the distilling loss and ranking loss, it obtains a pretty performance on the few-shot dataset.
∙∙\bullet∙COTED Mao et al. (2022) further introduces the curriculum denoising to inhibit the unhelpful turns in context. An additional two-step multi-task learning improves the performance of ConvDR.
∙∙\bullet∙T5(WikiD+WebD)Dai et al. (2022) trains on two large automatically generated conversational search dataset WikiDialog(11.4M dialogues) and WebDialog(8.4M dialogues) from a T5-large encoder checkpoint. Otherwise, it further warm-ups on the QReCC dataset. Though it does not fine-tune on CAsT-19 (50 dialogues) and CAsT-20 (25 dialogues), the extremely time-consuming training procedure makes its performance up to a stable level.
5.3 Evaluation Metrics
Following the previous works on conversational search, we evaluate all models based on M ean R eciprocal R ank (MRR) and N ormalized D iscounted C umulative G ain @3 (NDCG@3). MRR deems the ranking reciprocal of a positive sample as its score and counts the average of all samples. It is a simple yet effective metric for ranking tasks. NDCG@3 considers the importance of positive samples based on their relevance and chooses scores of the top 3 samples to normalize. The statistical significance of two runs is tested using a two-tailed paired t-test and is denoted using ††\dagger† and ‡‡\ddagger‡ for significance (p≤0.05 𝑝 0.05 p\leq 0.05 italic_p ≤ 0.05) and strong significance (p≤0.01 𝑝 0.01 p\leq 0.01 italic_p ≤ 0.01).
5.4 Implementation Details
Most settings in this work are similar to ConvDR Yu et al. (2021). We employ the ad-hoc retriever ANCE Xiong et al. (2020) as the teacher module to calculate the knowledge distillation loss. Following previous conversational search work, for CAsT-19, we concatenate the historical query and the current query as the model inputs, and we additionally take account of the historical responses for CAsT-20. The leading words in the conversational context will be truncated if the concatenation length exceeds a maximum length, which is 256 and 512 for CAsT-19 and CAsT-20 respectively. We implement experiments using PyTorch and Transformers library on an NVIDIA A40 GPU. Adam optimizer is employed with the learning rate of 2e−5 2 𝑒 5 2e-5 2 italic_e - 5 and batch size of 64 64 64 64 for CAsT-19 and 32 32 32 32 for CAsT-20. Our model will post-train 2 2 2 2 epochs and then fine-tune on the conversational search corpus. The self-supervised task weights α,β 𝛼 𝛽\alpha,\beta italic_α , italic_β and γ 𝛾\gamma italic_γ are set as 1e−2,1e−3,1e−2 1 𝑒 2 1 𝑒 3 1 𝑒 2 1e-2,1e-3,1e-2 1 italic_e - 2 , 1 italic_e - 3 , 1 italic_e - 2 for CAsT-19 and 1e−1,2e−3,2e−2 1 𝑒 1 2 𝑒 3 2 𝑒 2 1e-1,2e-3,2e-2 1 italic_e - 1 , 2 italic_e - 3 , 2 italic_e - 2 for CAsT-20. We use faiss Johnson et al. (2019) to index the passages, whose representations are generated by ANCE and fixed. Following the TREC Conversational Assistance competition official evaluation setting, we use relevance scale ≤\leq≤ 2 as positive for CAsT-19 and relevance scale ≤\leq≤ 1 for CAsT-20 and obtain our result based on official evaluation scripts.
6 Evaluation Result
6.1 Overall Performance
Table 2: Conversational search performance comparison. ⋆⋆\star⋆ denotes our implementation. ††\dagger† (‡‡\ddagger‡) indicates (strong) significant improvement over ConvDR with p≤0.05 𝑝 0.05 p\leq 0.05 italic_p ≤ 0.05 (p≤0.01 𝑝 0.01 p\leq 0.01 italic_p ≤ 0.01).
Search Method CAsT-19 CAsT-20 MRR NDCG@3 MRR NDCG@3 Sparse Raw 0.322 0.134 0.160 0.101 Tansformer++0.557 0.267 0.162 0.100 QueryRewriter 0.581 0.277 0.250 0.159 QuReTeC 0.605 0.338 0.262 0.171 Dense Raw 0.420 0.247 0.234 0.150 Tansformer++0.696 0.441 0.296 0.185 QueryRewriter 0.665 0.409 0.375 0.255 QuReTeC 0.709 0.443 0.430 0.287 ContQE-0.499-0.312 T5(WikiD+WebD)0.741-0.513- COTED 0.769 0.478 0.491 0.342 COTED⋆normal-⋆{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT 0.758 0.475 0.481 0.321 COTED-SSP 0.760 0.478 0.501 0.351 ConvDR 0.740 0.466 0.501 0.340 ConvDR-SSP 0.780†normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.480 0.526‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT 0.365‡normal-‡{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT
We compare our model with all baselines in Table2. We can find that the sparse methods generally achieve less satisfying performance than the dense conversational methods, which demonstrates the dense methods can understand the search intent of users better. Our model performs consistently better on two datasets than other sparse and dense conversational search models with improvements of 1.4% and 0.4% on the CAsT-19 dataset and achieves 7.1% and 6.7% improvements on the CAsT-20 dataset compared with COTED in terms of MRR, and NDCG@3 respectively. This demonstrates that our proposed self-supervised tasks provide a useful training signal for the conversational encoder module than the simple parameter warm-up method used in previous methods.
In Table2, we find that ContQE outperforms ConvDR-SSP on CAsT-19 in terms of NDCG@3. The possible reason is that Mao et al. (2022) has illustrated that ContQE introduces a stronger query encoder TCT-ColBERT Lin et al. (2020a) and it takes multi-stage methods to train their conversational encoder. In contrast to the complexity of the multi-stage method, our SSP can boost the performance of the existing conversational search model in an end-to-end manner which is easier to train and deploy in real-world applications. We will leave adapting this stronger encoder TCT-ColBERT into the post-training paradigm in our future work.
To verify the generalization ability of SSP, we equip our proposed S elf-S upervised P ost-training to two strong conversational search methods (COTED and ConvDR), which can provide a better conversational context encoder. From the comparison between COTED and COTED-SSP, ConvDR and ConvDR-SSP, we can find that our proposed new post-train paradigm can adapt to different conversational search models and boost their performance, which demonstrate the effectiveness and generalization ability of our proposed SSP.
6.2 Ablation Study
Table 3: Comparison between ablation models.
Method CAsT-19 CAsT-20 MRR NDCG@3 MRR NDCG@3 ConvDR-SSP 0.780 0.480 0.526 0.365 w/o. TS 0.753 0.473 0.513 0.355 w/o. CI 0.749 0.472 0.515 0.351 w/o. WR 0.757 0.476 0.512 0.357
We remove each self-supervised task to analyze the effectiveness of each component, and TS is the acronym for topic segmentation, CI denotes the coreference identification and WR denotes word reconstruction. The performance of ablation models is shown in Table3, and we can find that all of the ablation models perform less promising than the best model ConvDR-SSP, which demonstrates the preeminence of each self-supervised task in SSP.
We ablate the topic segmentation task in ConvDR-SSP w/o. TS and observe the decline in search performance. The topic segmentation task helps the model identify the topic boundary in the long session and pay more attention to the utterances in the related topics This makes the retrieval performance raises 3.6%percent 3.6 3.6%3.6 % and 2.5%percent 2.5 2.5%2.5 % in terms of MRR on the CAsT-19 and CAsT-20 datasets respectively. In the method ConvDR-SSP w/o. CI, we remove the coreference identification self-supervised task and the performance of this ablation model dropped dramatically, which demonstrates that it plays the most important role in SSP. The experiment shows that our ConvDR-SSP achieves 4.1% and 1.7% increments compared with ConvDR-SSP w/o. CI in terms of MRR score on the CAsT-19 and CAsT-20 datasets. We also remove the word reconstruction task yielding ConvDR-SSP w/o. WR, and the dropped score shows that it is effective to keep the contextual semantic in the context representation. All of our self-supervised tasks, which provide extra supervision signals to understand dialog structure and prevent the semantic vanishing, help ConvDR-SSP achieves the best performance according to the experimental results.
6.3 Robustness of Topic Segmentation
Figure 4: Robustness evaluation by adding the different numbers of off-topic utterances. We randomly sample irrelevant utterances from other search sessions and evaluate the results of ConvDR and ConvDR-SSP.
Table 4: Retrieved examples of ConvDR-SSP and ConvDR. We present historical queries, current query (underlined), manual reformulation query (Ref) and the first passages different methods disagree. The key information in the conversations and passages are marked in blue and red respectively.
Queries First Disagreed Passages CAsT Topic-31 What is throat cancer?ConvDR: There are two main types of esophageal cancer: squamous cell cancer and adenocarcinoma of the esophagus. Squamous cell cancer occurs most commonly in African Americans as well as people who smoke cigarettes…ConvDR-SSP: In fact, some people diagnosed with throat cancer are diagnosed with esophageal, lung, or bladder cancer at the same time. This is typically because cancers often have the same risk factors, or because cancer that begins in one part of the body can spread throughout the body… Is it treatable? Tell me about lung cancer. What are its symptoms? Can it spread to the throat? What causes throat cancer? What is the first sign of it? Is it the same as esophageal cancer? Ref:Is throat cancer the same as esophageal cancer? CAsT Topic-58 What is a real-time database?ConvDR: Examples of what the database describes.ConvDR-SSP: A real-time database is a database systemwhich uses real-time processing to handle workloads whose state is constantly changing. This differs from traditional databases containing persistent data. For example… How does it differ from traditional ones? What are the advantages of real-time processing? What are examples of important ones? Ref:What are examples of important real-time databases? CAsT Topic-59 Which weekend sports have the most injuries?ConvDR: To help recover from minor injuries, overexertion or surgery, Arnica is a must for every medicine cabinet. Whether you are an active baby boomer… ConvDR-SSP: Injury Prevention Basics. It’s always better to prevent an injury than to recovery from one, so learning and following basic injury prevention advice is step one. The best way to avoid injuries is to be prepared for your sport, both physically and mentally. Don’t succumb to the weekend warrior syndrome… What are the most common types of injuries? What is the ACL? What is an injury for it? Tell me about the RICE method. Is there disagreement about it? What is arnica used for? What are some ways to avoid injury? Ref:What are some ways to avoid sports injuries?
To verify the effectiveness of the topic segmentation of our method, we conduct an experiment that concatenates different lengths of randomly sampled utterances to the beginning of the current conversation session. In this experiment, we use the ConvDR as our baseline. Figure4 shows the search performance of our SSP and ConvDR with different length of random sampled noise utterances as input. From Figure4, we find that our SSP is more robust to concatenate more random sampled utterances. When we concatenate more random sampled utterances, the performance of ConvDR dropped dramatically while ConvDR-SSP slightly dropped in the beginning and kept stable. The reason for this phenomenon lies in that our model can identify the topic segmentation boundary and reduce the impact of unrelated utterances when encoding the current conversational query. This demonstrates that the topic segmentation helps the model focus on the utterances of relevant topics.
Xet Storage Details
- Size:
- 55.5 kB
- Xet hash:
- 40a5df168c1d5e1a1ce9de0a4c1832f357516bbfe408e41b743fa2b85de3eed0
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.



