100 kB

Title: Preserving Multilingual Quality While Tuning Query Encoder on English Only

URL Source: https://arxiv.org/html/2407.00923

Published Time: Tue, 08 Apr 2025 00:45:29 GMT

Markdown Content: Oleg Vasilyev, Randy Sawaya, John Bohannon

Primer Technologies Inc.

San Francisco, California

Abstract

A query encoder of a dual passage retrieval system can be tuned for specific types of queries or domains, while the precomputed and stored documents representations are kept intact. Switching from one query encoder to another when needed is easily feasible, unlike overhauling the embeddings of a whole knowledge base. In this work we raise a question: Can the generic, original qualities of the encoder be preserved or at least left not too degraded when it is tuned on a narrow domain? We conducted experiments on a high quality multilingual embedding model: Tuning it on a single English-only dataset, we observe that the tuning not only preserves the multilingual qualities, but even improves them. The embedding qualities on distinctly different data are also improved or at least preserved. Drawing on our observations, we suggest a more general hypothesis: Tuning with intentionally low learning rate can preserve or improve a system’s properties acquired in training, but not specifically targeted by tuning. We call this adiabatic tuning and provide tentative explanations.

Preserving Multilingual Quality While Tuning Query Encoder on English Only

Oleg Vasilyev, Randy Sawaya, John Bohannon Primer Technologies Inc.San Francisco, California oleg,randy.sawaya,john@primer.ai

1 Introduction

Advances in neural NLP methods have resulted in high quality dense vector text representations Reimers and Gurevych (2019); Cer et al. (2018); Conneau et al. (2017). Such representations are often used at the initial stages of an information retrieval system, selecting the most relevant documents, ranked relative to the query Xiong et al. (2020); Zhan et al. (2020, 2021); Ren et al. (2021b). A dual encoder is successfully used to train the representations Karpukhin et al. (2020); Ren et al. (2021a); Qu et al. (2021); Hofstätter et al. (2021); Ni et al. (2022); Dong et al. (2022). A dual encoder dense passage retrieval system is efficient for two main reasons: (1) it allows using the simple inner product of query and document representations, and (2) it allows modifying the query representation for a task or domain, while keeping the stored and precomputed (query-invariant) document representations intact.

If the representation was pretrained in a multilingual setting, tuning on English-only samples may be expected to degrade the multilingual qualities and there may not be enough cross-lingual samples for tuning on a specific domain or types of queries. A multilingual query generator may be employed to overcome a shortage of cross-lingual data Ren et al. (2022); Zhuang et al. (2023), but, in this work, we follow an arguably simpler strategy. In order to understand the effect of English-only tuning on multilingual qualities of a representation, and to assess a possible degradation, we consider a simple setup: A state of the art multilingual embedding model is taken as the starting point, and fine-tuned by English only samples as the query representation part of a dual encoder.

We assume that our observations of the degradation or preservation of the multilingual qualities may be generalized to other pretrained system qualities that are not directly targeted in tuning. In order to obtain preliminary confirmation of this hypothesis, we also observe the effect of tuning on the embedding quality for queries and text chunks of very different styles, the likes of which could be present in the training of the original encoder, but certainly not targeted in tuning.

Our contribution:

1.We show that fine-tuning a query encoder on an English-only dataset may not only preserve multilingual qualities of query-document embeddings matching, but even improve them.
2.We hypothesize that a tuning regime with intentionally low learning rate (far below of what is necessary to avoid overfitting) preserves or improves the properties acquired in the training, but not targeted by tuning. We call this adiabatic tuning and suggest supporting observations and conjectural explanations.
3.We add a dataset with graded difficulty, based on ARXIV titles and abstracts.

Although high-resource languages can be used for cross-lingual transfer Lin et al. (2019), our setting does not have such a goal: the tuning is set to improve the query part of a dual encoder on a certain dataset, with no driving mechanism for preserving or improving the other qualities of the system.

Our starting point is one of the best (for its lean size) multilingual embedding models which differs from starting with a multilingual language model and then aligning the generated embeddings for different languages Wang et al. (2022).

2 Setup

2.1 Models

In what follows, we use a state-of-the-art multilingual model intfloat/multilingual-e5-small 1 1 1 https://huggingface.co/intfloat/multilingual-e5-small Wang et al. (2024b) which will be referred to here as E⁢5 𝐸 5 E5 italic_E 5. For most of the evaluations, we also consider results using sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 2 2 2 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Reimers and Gurevych (2019), referred to as L⁢12 𝐿 12 L12 italic_L 12. Finally, we confirm some observations with monolingual intfloat/e5-small-v2 3 3 3 https://huggingface.co/intfloat/e5-small-v2 Wang et al. (2024a), referred to as E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e. All these models provide embeddings of a practical small size of 384 384 384 384.

2.2 Datasets

We use MSMARCO Nguyen et al. (2018) Triplets 4 4 4 https://huggingface.co/datasets/sentence-transformers/embedding-training-data/blob/main/msmarco-triplets.jsonl.gz for tuning and evaluation. For evaluating the qualities not targeted by tuning, we use the ARXIV dataset with negatives 5 5 5 https://huggingface.co/datasets/primer-ai/arxiv-negatives, which we made from arxiv (version 173)6 6 6 https://huggingface.co/datasets/arxiv-community/arxiv_dataset 7 7 7 https://www.kaggle.com/datasets/Cornell-University/arxiv, and the test subset of the XNLI multilingual dataset 8 8 8 https://huggingface.co/datasets/facebook/xnli Conneau et al. (2018). We also use HOTPOTQA 9 9 9 https://hotpotqa.github.io/Yang et al. (2018) and SQUAD 10 10 10 https://huggingface.co/datasets/rajpurkar/squad_v2 Rajpurkar et al. (2018, 2016) for confirming some observations (AppendicesC,D).

Our test subset of MSMARCO contains 357642 evaluation triplets, made of 7000 samples - all the positives and negatives are used (AppendixA).

Of ARXIV we use titles and abstracts. We made two flavors of evaluation arxiv triplets: (1) arxiv-title where a title plays role of the query (anchor), and the corresponding abstract is a positive passage, and (2) arxiv-first where the first sentence of abstract is used as the query, and the rest of it is used as a positive (AppendixB). We also use narrow versions of arxiv-first in AppendixK.

2.3 Tuning and evaluations

Unless otherwise specified, we freeze the text encoder and proceed to fine-tune only the query encoder (fully or partially unfrozen) by contrastive learning on MSMARCO (or on narrow ARXIV subsets, AppendixK) with a learning rate of 5e-8, batch size of 14 and the triple margin loss with margin 0.1 0.1 0.1 0.1. Other details are in AppendixE. In our experiments we considered different settings of freezing, batch size, learning rate, the margin of triplet loss, the stopping criterion, weight decay, scheduling versions and optimizers.

In most of our evaluations, we compare the similarity (or distance) between the anchor (query) and the positive vs the negative. If the positive does not turn out to be closer than the negative to the anchor, we count this as an error. We thus characterize performance of the encoder on a query by the number of errors divided by the total number of positive-negative pairs. We call this positive-negative discrepancy (PND). The measure is easy to interpret, and its range (from 0 to 1) is the same and equally fair for any amounts of positives and negatives, as long as they exist in a selection for a query. On multiple queries we take an averaged PND. We confirm some results also using mean reciprocal rank (MRR), mean average precision (MAP) and precision at top 1 (P@1). The improvement of performance is measured as relative change of a measure M 𝑀 M italic_M (PND or MRR or other):

I=s⁢M~~−M M 𝐼 𝑠~~𝑀 𝑀 𝑀 I=s\frac{\tilde{M}-M}{M}italic_I = italic_s divide start_ARG over~ start_ARG italic_M end_ARG - italic_M end_ARG start_ARG italic_M end_ARG(1)

where M 𝑀 M italic_M is for the original encoder, and M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG is for the encoder after the tuning. The sign s=−1 𝑠 1 s=-1 italic_s = - 1 for PND, because it decreases when improved, and s=1 𝑠 1 s=1 italic_s = 1 for the other measures.

For evaluating XNLI we use its pairs of sentences, each sentence is given in 15 languages (AppendixF). One sentence is used as a query, another as a passage. All pairs are human-labeled as entailment, neutral or contradiction. Hence, the sentences of an entailment pair should be closer to each other than the sentences of any neutral or contradiction pair. Whenever this does not happen, we count this as an error for PND. In AppendixG we made sure that the amount of errors the original encoder makes on our datasets is large enough to consider how tuning would affect them.

3 Observations

Table 1: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query model tuned on MSMARCO as described in Section2.3. The rows are in the order of increased freezing (at tuning): from no freezing (top row) to freezing everything up to the last transformer block B⁢11 𝐵 11 B11 italic_B 11. The e⁢m⁢b.b⁢a⁢s⁢e formulae-sequence 𝑒 𝑚 𝑏 𝑏 𝑎 𝑠 𝑒 emb.base italic_e italic_m italic_b . italic_b italic_a italic_s italic_e model has only the first three layers of the embedding block frozen (tokens, positions, token-types). The e⁢m⁢b 𝑒 𝑚 𝑏 emb italic_e italic_m italic_b model has the full embedding block frozen. For the other notation: B⁢0 𝐵 0 B0 italic_B 0 is the full first transformer block; B⁢0 𝐵 0 B0 italic_B 0-5 5 5 5 are the first 6 blocks; the extensions a 𝑎 a italic_a, i 𝑖 i italic_i, o⁢d 𝑜 𝑑 od italic_o italic_d (for B⁢0 𝐵 0 B0 italic_B 0) denote the layers a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 attention italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n, i⁢n⁢t⁢e⁢r⁢m⁢e⁢d⁢i⁢a⁢t⁢e 𝑖 𝑛 𝑡 𝑒 𝑟 𝑚 𝑒 𝑑 𝑖 𝑎 𝑡 𝑒 intermediate italic_i italic_n italic_t italic_e italic_r italic_m italic_e italic_d italic_i italic_a italic_t italic_e and o⁢u⁢t⁢p⁢u⁢t.d⁢e⁢n⁢s⁢e formulae-sequence 𝑜 𝑢 𝑡 𝑝 𝑢 𝑡 𝑑 𝑒 𝑛 𝑠 𝑒 output.dense italic_o italic_u italic_t italic_p italic_u italic_t . italic_d italic_e italic_n italic_s italic_e of the block. The columns c%percent 𝑐 c%italic_c % and d%percent 𝑑 d%italic_d % show the PND improvement (in percents) relative to the original model, accessed by cosine (c 𝑐 c italic_c) or distance (d 𝑑 d italic_d), grayed if not significant (AppendixI). The columns c 𝑐 c italic_c+/- and d 𝑑 d italic_d+/- show count of language pairs with PND significantly improved (+++) or worsened (−--).

3.1 Tuning partially frozen query model

In Table1 we show results of tuning the dual encoder, with the text encoder frozen and query model free or partially frozen. Here and throughout the paper we use the easiest version of ARXIV (see AppendixH on performance at other levels). Freezing the embedding block appears to be the best option for preserving the multilingual qualities, and henceforth it is used unless specified otherwise. In Table2 we confirm the improvement on six other datasets (AppendicesA,C,D), and show some other measures.

Table 2: Improvements for E⁢5 𝐸 5 E5 italic_E 5 tuned with frozen embedding block and learning rate 5e-8.

The multilingual qualities are not only preserved, but even mostly improved, especially on cosine similarity. The PND improvement is shown for each language pair separately in Figure1. The results for the L⁢12 𝐿 12 L12 italic_L 12 model are similar (AppendixJ). In AppendixK we also also confirm our observations with E⁢5 𝐸 5 E5 italic_E 5 tuned on specific categories of ARXIV.

Figure 1: Improvement of E⁢5 𝐸 5 E5 italic_E 5 on XNLI assessed by cosine. Query is on axis Y 𝑌 Y italic_Y; text is on X 𝑋 X italic_X.

3.2 Learning rate and adiabatic tuning

Figure 2: Evaluations on (a) XNLI and (b) the English-only datasets (MSMARCO and ARXIV) of the E5 query encoder tuned with a frozen embedding block, batch size 14, margin 0.1 using different learning rates. Values that did not pass the two-tailed test are shown with open markers.

Increasing the tuning learning rate delivers more gains on MSMARCO, while eventually reducing gains on XNLI and even ARXIV. Improvement of PND on MSMARCO and ARXIV is shown in Figure2(b); the number of language pairs improved and degraded is in Figure2(a). AppendixL contains the corresponding plots (Figure11) for the fully tuned E⁢5 𝐸 5 E5 italic_E 5 dual encoder, and for the L⁢12 𝐿 12 L12 italic_L 12 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e models. It is interesting that the E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e model, not even being multilingual, still improves more than it degrades its rudimentary multilingual qualities. The effects of other tuning parameters are described in AppendixM. For example, the square-root batch size scaling rule works better than linear.

If we consider XNLI and ARXIV as indicators of how well a model keeps the learned skills while improving on narrow goals (e.g. MSMARCO), then our observation suggests there may be a slow tuning regime, at which the model preserves or even improves the existing skills which are at least a little related to the new goal. We call this adiabatic tuning, in analogy to the slow process in quantum mechanics (a system starting in an eigenstate is kept in the same evolving eigenstate). For E⁢5 𝐸 5 E5 italic_E 5 the learning rates between 2e-8 and 6e-8 may be considered as the best.

Our tentative explanation of adiabatic tuning is as follows: At low learning rates of tuning, the system (the encoder weights) remains in the ’minimum’ region found at pretraining. This ’minimum’ region is probably a wide well with uneven ground; the pretraining happened to terminate at some point inside the well. During tuning, the pretraining weight-space of twin encoder becomes just another surface in a family of surfaces, because of the added dimensions (the difference between the weights of the two encoders). We assume that due to continuity, the ’minimum’ region, even if being reshaped, remains a well as the query encoder weights drift away from the weights of the text encoder. Within this well, improvements of all qualities related to the former, pretraining loss, may be still correlated. But if, at high learning rate, the model is strongly modified at some iteration (i.e. by backpropagation on a particular batch), then it may move away from the well.

3.3 Extending adiabatic tuning range

Figure 3: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block and all layers ‘out-put.dense.weight’, with batch size 14, margin 0.1 using different learning rates on (a) XNLI and (b) the English-only datasets (MSMARCO and ARXIV). Values that did not pass the two-tailed test are shown with open markers.

From evaluation results in Figure2 we may consider the learning rate below 7e-8 (but above 1e-8) as safely suitable for adiabatic tuning. But we know this only because we evaluated the tuned models on the out-of-tuning domains ARXIV and XNLI.

Is there any way to know the upper boundary without having extensive data for evaluation? Could there be an empirical recommendation not to exceed certain learning rate? Can we increase the adiabatic tuning range of learning rate?

In attempting to answer these questions, we have considered the largest changes in the layers at different learning rates. One suspect layer, by simple crude measures, is output.dense.weight. In AppendixM.3 in Tables14 and15 we show the most changing layers and the blocks to which they belong. Our motivation here is based on a simple and crude criteria; more detailed research and understanding may reveal better ways to extend the adiabatic tuning regime.

The gains from the tuning by freezing the layer output.dense.weight (in each transformer block) are shown in Figure3. In comparison to the default tuning (Figure2) we can see that the adiabatic regime indeed extends from a learning rate of about 6e-8 (as was in Figure2) to about 1.3e-7. Thus, freezing of output.dense.weight did help to somewhat extend the adiabatic tuning regime. However, this did not improve the gains, and further increase of the learning rate results in worse deterioration for the version with frozen output.dense.weight layer, as can be seen for XNLI starting from the rate 1.4e-7.

Another way of trying to stay longer in the original ’minimum’ region during tuning could be by reducing the inertia of the optimizer. We present a simple attempt in AppendixM.8, but the results are mixed.

4 Conclusion

We considered tuning the query part of a dual encoder starting from a high quality multilingual embedding model, and using English-only samples in the tuning. We found that multilingual qualities are quite stable in many scenarios of the tuning, and can be not only preserved but improved. We explain this by speculating that most of the transformer, except the embedding block, depends weakly on multiple languages. We think of this as a particular case of a general pattern: tuning a certain model quality, if done carefully enough (adiabatic tuning), can also retain or even improve the related (but not targeted by tuning) qualities. This allows a resource-light adjustment of multilingual embeddings for a specific query type or domain, even a narrow domain (AppendixK).

Limitations

Our considerations here are limited to starting with a single high quality multilingual embedding model, and tuning it (on English-only samples) as a query encoder. While this setup is good for our understanding and convenient for adjusting an existing model, it would be natural to follow this up by considering a pre-trained multilingual dual encoder which is already asymmetric from the start.

For our illustration we used the state of the art multilingual model intfloat/multilingual-e5-small, and also, for comparison, repeated the same observations for the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. We also repeated some of observations on monolingual model intfloat/e5-small-v2 - the tuning improved its rudimentary multilingual properties as well. Still, to gain a better understanding of the observed behaviors, it would be interesting to investigate more multilingual models.

We considered tuning the query encoder on English-only samples, and found that such tuning can “pull up” the quality of other languages too. Choosing another language for tuning would be interesting both for understanding and as a practical scenario.

We used MSMARCO triplets for tuning; we also verified some observations for models tuned on ARXIV-based subsets limited to a category (math, physics or cs, AppendixK). For evaluation we used a set aside part of MSMARCO triplets, and ARXIV in two variations, and XNLI. The motivation was that the MSMARCO evaluation part must show improvement (after tuning), ARXIV must verify the robustness of the improvement on a very different kind of texts (jargon-heavy), and XNLI must reveal the effect of the English-only driven improvement on multilingual qualities. We also confirmed the tuning gains on SQUAD and HotpotQA (both of which are quite different from MSMARCO). That said, the evaluations can be extended to even more datasets.

More research could be helpful in understanding and identifying the range of adiabatic tuning.

References

Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. arXiv, arXiv:1803.11175.
Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
Dong et al. (2022) Zhe Dong, Jianmo Ni, Dan Bikel, Enrique Alfonseca, Yuan Wang, Chen Qu, and Imed Zitouni. 2022. Exploring dual encoder architectures for question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9414–9419, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Goyal et al. (2018) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv, arXiv:1706.02677.
Hoffer et al. (2018) Elad Hoffer, Itay Hubara, and Daniel Soudry. 2018. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv, arXiv:1705.08741.
Hofstätter et al. (2021) Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 113–122, New York, NY, USA. Association for Computing Machinery.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
Krizhevsky (2014) Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv, arXiv:1404.5997.
Lin et al. (2019) Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Computational Linguistics.
Nguyen et al. (2018) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2018. MS MARCO: A human generated MAchine Reading COmprehension dataset. arXiv, arXiv:1611.09268.
Ni et al. (2022) Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Qu et al. (2021) Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, Online. Association for Computational Linguistics.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Ren et al. (2022) Houxing Ren, Linjun Shou, Ning Wu, Ming Gong, and Daxin Jiang. 2022. Empowering dual-encoder with query generator for cross-lingual dense retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3107–3121, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Ren et al. (2021a) Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021a. PAIR: Leveraging passage-centric similarity relation for improving dense passage retrieval. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2173–2183, Online. Association for Computational Linguistics.
Ren et al. (2021b) Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021b. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Wang et al. (2024a) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024a. Text embeddings by weakly-supervised contrastive pre-training. arXiv, arXiv:2212.03533.
Wang et al. (2024b) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024b. Multilingual E5 text embeddings: A technical report. arXiv, arXiv:2402.05672.
Wang et al. (2022) Yau-Shian Wang, Ashley Wu, and Graham Neubig. 2022. English contrastive learning can learn universal cross-lingual sentence embeddings. arXiv, arXiv:2211.06127.
Xiong et al. (2020) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv, arXiv:2007.00808.
Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 1503–1512, New York, NY, USA. Association for Computing Machinery.
Zhan et al. (2020) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized text embeddings for first-stage retrieval. arXiv, arXiv:2006.15498.
Zhuang et al. (2023) Shengyao Zhuang, Linjun Shou, and Guido Zuccon. 2023. Augmenting passage representations with query generation for enhanced cross-lingual dense retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 1827–1832, New York, NY, USA. Association for Computing Machinery.

Appendix A Usage of MSMARCO Triplets

The MSMARCO dataset consists of 499184 samples, with each sample being a tuple given as (query, positives, negatives). The “positives” are the correct answers to the query, and the “negatives” are semantically similar, but incorrect answers. For most samples, there is only one positive, but many negatives. For our tuning we simply select the very first positive and the very first negative. Thus, each sample gives one triplet (anchor, positive, negative) for contrastive learning, where the query is taken as an anchor.

We keep the first 487983 samples (or 34856 batches if each batch is 14 triplets) for tuning, leaving the next 4200 samples (300 batches) for validation, and the last 7000 samples for evaluation. During evaluation we create all possible triplets from the 7000 samples, using all positives and negatives; this makes 357642 evaluation triplets.

Almost half of MSMARCO samples have the maximal number of negatives (65), and for evaluation shown in Table2 we use a more difficult version ’MSMARCO 65 negatives’, with all samples with less than 65 negatives filtered out.

Appendix B ARXIV Dataset for Triplets

B.1 Dataset arxiv-negatives

Of ARXIV we use titles and abstracts. In order to have a representative subset of a manageable size for our evaluations, we select all samples that have at least one category with a maximum size of 10K samples. For example, the arxiv category bayes-an is the smallest (size 16) in our snapshot (version 173), meaning that there were only 16 arxiv preprints in this category.

We made two flavors of evaluation arxiv triplets from this arxiv subset. In the first version, the anchor is the title, the positive is the corresponding abstract, and the negative is another random abstract. In the second version the anchor is the first sentence of the ’positive’ abstract, the positive is the rest of the abstract, and the negative is a similar piece (first sentence excluded) of the ’negative’ abstract.

We make use of triplets created from arxiv because this provides our evaluation with a very different kind of text (compared to MSMARCO), and thus allows us to judge the robustness of the improvement. For convenience and reproducibility of creating triplets of different levels of difficulty, we made a dataset arxiv-negatives 11 11 11 https://huggingface.co/datasets/primer-ai/arxiv-negatives.

The dataset consists of 253140 samples, each sample is a tuple of two elements:

1.An ARXIV paper metadata, including its Id, title and abstract and categories.
2.List of 21 Ids of other ARXIV papers. The first 20 Ids are the papers that are ’closest’ to the above paper, and sorted from the most to the least similar; the last 21st Id is an Id of a randomly selected paper (not coinciding with Id of the above paper).

Thus, we have 21 versions of picking up negatives for triplets, from the most difficult to the easiest (the last one, of the random selection).

For example, to create triplets of difficulty 14, for each paper given by the first tuple element, we pick up a paper corresponding to 14th Id given in the second tuple element. From the first paper we can create query and positive, and from the second paper, negative. Through this work we used two flavors:

1.‘Title’: The title of the first paper acts as the query and its abstract as the positive; the negative is then the abstract of the second paper.
2.‘First’: The query is the first sentence of the abstract of the first paper; the positive is the rest of the abstract; the negative is the abstract of the second paper, with its first sentence deleted.

B.2 How is it created?

The above dataset is created from the mirror of arxiv (version 173) arxiv-metadata-oai-snapshot.jsonl through the following steps:

1.Identified all arxiv categories with a maximum size of 10K papers (i.e. arxiv preprints).
2.Selected all papers that have at least one of the categories identified above. This is the subset of arxiv to deal with: manageably small, yet diverse.
3.For each paper: (1) Sort its categories by size, from smaller to larger. (2) Find all other papers that have the closest match by the categories (the closest match is the longest consecutive list of matched categories, starting from the first one). (3) Of the found papers, select 20 closest by Jensen-Shannon distance between the paragraphs, and sort them by the distance. If there were less than 20 papers, fill to 20 by the last one. (4) Add randomly selected paper as 21st.

Of the total 253140 samples, in 213156 samples (84.2%) all the first 20 negatives are different (which means that not less than 20 papers happen to have the same closest match by categories).

Appendix C SQUAD

For using the SQUAD dataset, we identified (for each query) the given paragraph sentences containing an answer to the query as positives, and the rest of the sentences as negatives. We left samples having at least 1 positive and 1 negative. On average there is 1.3 positives and 4.2 negatives per a query. For the evaluation shown in Table2 we combined train, validation and test subsets. The results are given also for a version called ‘SQUAD min 5’, in which we have filtered out queries that had less than 5 candidate sentences.

Appendix D HotpotQA

For using HotpotQA, we combined its train and dev subsets. For each query (‘question’) both train and dev subsets contain on average 9.95 passages, of which 2 are always positives. For the evaluation shown in Table2 we filtered out queries that had less than 10 passages, and split the dataset into ‘easy’, ‘medium’ and ‘hard’ subsets accordingly to the HotpotQA labels of the difficulty of the samples.

Appendix E Tuning

Unless specified otherwise, we tune a dual encoder by contrastive learning in the following simple regime:

1.The text encoder is fully frozen; the frozen parts of the query encoder are specified.
2.The batch size is 14, the learning rate is 5e-8 and the contrastive learning margin is 0.1 0.1 0.1 0.1. The loss is defined by the triple margin loss.
3.There are 1000 1000 1000 1000 batches per epoch, i.e. 14000 samples per epoch.
4.Stopping occurs after 10 10 10 10 consecutive non-improvement epochs. The improvement is measured on the validation subset after each epoch. The model is considered to be improved if (on the validation subset) both the loss and the count of errors have decreased.
5.The AdamW optimizer is used.

Changing this default regime is considered in AppendixesL,M.

Appendix F XNLI

The XNLI dataset consists of pairs of sentences which are human-labeled as entailment, neutral or contradiction. The test subset (which we use) contains 1670 pairs for each of these labels and each sentence is presented in 15 languages: [’ar’, ’bg’, ’de’, ’el’, ’en’, ’es’, ’fr’, ’hi’, ’ru’, ’sw’, ’th’, ’tr’, ’ur’, ’vi’, ’zh’]. We use 225 versions of the pairs, because each sentence of the pair can be in any of the 15 languages. At evaluation the first sentence serves as the query (the embedding is taken by the query model), and the second one as the text. We expect that the sentences of an entailment pair should be closer to each other than the sentences of any neutral pair, or of any contradiction pair. Whenever this does not happen, we count this as an error.

Appendix G Performance of Untuned Query Encoder

To establish a baseline before any fine-tuning, and to ensure our evaluation is not too easy, we measure the errors of the original E⁢5 𝐸 5 E5 italic_E 5 model on the data described in Section2.3 and show the results in Table3. We also measure the errors of L⁢12 𝐿 12 L12 italic_L 12 and of E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e - a more recent monolingual (English) model.

Table 3: The count of errors for the original untuned models E⁢5 𝐸 5 E5 italic_E 5, L⁢12 𝐿 12 L12 italic_L 12 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e, on the datasets noted in the first column: MM - MSMARCO test 7000 samples (357642 triplets, see Section2.2 and AppendixA); ARX-F - arxiv-first, the arxiv subset with the abstract’s first sentence as an anchor; ARX-T - arxiv-title, the arxiv subset with the title as an anchor; XNLI - XNLI test subset providing 1670x1670=2788900 comparisons of entailment pairs vs neutral pairs (and the same amount of entailment pairs vs contradiction pairs). For XNLI the errors are averaged over 225 (15x15) language-language versions, and shown as percent of N⁢t⁢o⁢t⁢a⁢l 𝑁 𝑡 𝑜 𝑡 𝑎 𝑙 Ntotal italic_N italic_t italic_o italic_t italic_a italic_l. The evaluation is done using cosine similarity or euclidean distance similarity (cos or dist in second column).

The count of errors on the triplets (MSMARCO, ARXIV) is straightforward: it is an error when a positive is not closer than a negative to the anchor of the triplet. On XNLI we sum up the error count over all language-language pairs and divide the sum by the number (255=15x15) of such pairs. This averaged error is shown as a percentage of the total (2788900) comparisons; each comparison here is either a comparison of an entailment-labeled sample with a neutral-labeled sample (entail-neutral in the table) or a comparison of an entailment-labeled sample with a contradiction-labeled sample (entail-contr in the table). An error was counted whenever the sentences of an entailment sample happened to be farther from each other than the sentences of a neutral (or contradiction) sample. Separately for each pair of languages PND is shown in Figures4,5 for cosine similarity measure. The distance measure gives results visually almost undistinguishable.

Figure 4: PND of embedding models on XNLI entailment-neutral comparisons assessed by cosine.

Figure 5: PND of embedding models on XNLI entailment-contradiction comparisons assessed by cosine.

The amount of errors in Table3 and in Figures4,5 is reasonable enough to consider how tuning would affect them. The smallest counts are the counts of positive-negative discrepancies of E⁢5 𝐸 5 E5 italic_E 5 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e on ARX-T (apparently, a title makes an easier ’query’ than the first sentence of an abstract). These counts are 309 and 420 for the cosine similarity (the row ARX-T PND (cos)), and 453 and 580 for the distance similarity (the row ARX-T PND (dist)).

Notice that L⁢12 𝐿 12 L12 italic_L 12 has far worse PND on English data (MSMARCO and ARXIV). The English-only model E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e, as expected, performs worse than multilingual models E⁢5 𝐸 5 E5 italic_E 5 and L⁢12 𝐿 12 L12 italic_L 12 on multilingual XNLI, but its PND is still far below 50%percent 50 50%50 %, because there is much similarity between some of the languages.

Appendix H Gains on ARXIV for Different Levels of Difficulty

Figure 6: Errors and improvements on arxiv-negatives dataset of different level of difficulty. The “easiest” dataset is a random selection of negatives from the same data used through this work in evaluations. In (a), we show the fraction of errors done by the original E5 model (for comparison, see Table3). In (b), we show the improvement after tuning the query encoder on MSMARCO, with ‘default’ settings, i.e. learning rate 5e-8, batch size 14, margin 0.1 and frozen embedding block. Values that did not pass the two-tailed test (Appendix I) are shown with open markers.

Throughout the paper we used the easiest version of triplets in the arxiv-negatives dataset, the version that uses randomly selected negatives. Here in Figure6 we show, for comparison, the fraction of the errors which occur in the original untuned E⁢5 𝐸 5 E5 italic_E 5 embeddings using the other levels of difficulty, and also the corresponding improvements (by Equation1) after tuning the query encoder on the MSMARCO with frozen embedding block and our default settings (Section2.3). The statistical significance of the improvements in Figure6 is estimated as explained in AppendixI.

The difficulty of intentionally close negatives is much harder, but Figure6 still shows that performance on ARXIV was mostly improved. We used the easiest triplets version for our evaluations throughout the paper because it more distinctly indicated the trends in the improvements.

Appendix I Significance Test

In Table1, Figure2 and through the paper we use two-proportion Z 𝑍 Z italic_Z-test, pooled for H 0:p 1=p 2:subscript 𝐻 0 subscript 𝑝 1 subscript 𝑝 2 H_{0}:p_{1}=p_{2}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We are comparing the number of errors original n 0 subscript 𝑛 0 n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and improved n 1 subscript 𝑛 1 n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, having the total N 𝑁 N italic_N (the totals can be seen in Table3); a total is the same for original and improved version. We deem the difference to be significant if |Z|>Z c 𝑍 subscript 𝑍 𝑐|Z|>Z_{c}| italic_Z | > italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT where

Z=p 1−p 0 1 2⁢P⁢(1−P)⁢N 𝑍 subscript 𝑝 1 subscript 𝑝 0 1 2 𝑃 1 𝑃 𝑁 Z=\frac{p_{1}-p_{0}}{\sqrt{\frac{1}{2}P(1-P)N}}italic_Z = divide start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_P ( 1 - italic_P ) italic_N end_ARG end_ARG(2)

with p 0=n 0/N subscript 𝑝 0 subscript 𝑛 0 𝑁 p_{0}=n_{0}/N italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_N, p 1=n 1/N subscript 𝑝 1 subscript 𝑛 1 𝑁 p_{1}=n_{1}/N italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_N and P=1 2⁢(n 0+n 1)/N 𝑃 1 2 subscript 𝑛 0 subscript 𝑛 1 𝑁 P=\frac{1}{2}(n_{0}+n_{1})/N italic_P = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_N. We used Z c=1.96 subscript 𝑍 𝑐 1.96 Z_{c}=1.96 italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1.96, which is a critical value corresponding to probability 0.975.

Notice that in our examples the values N 𝑁 N italic_N are typically very large. And the improvements we report, according to Equation1, are relative, not absolute values.

Appendix J Encoder L12 with Frozen Layers

Table4 shows results of tuning with freezing some of L⁢12 𝐿 12 L12 italic_L 12 layers. It is similar to the Table1 for E⁢5 𝐸 5 E5 italic_E 5. And, similar to E⁢5 𝐸 5 E5 italic_E 5, freezing everything except the embedding, resulted in negligible changes of the query encoder (not shown in the table).

The changes in cross-lingual qualities corresponding to the third row (emb, frozen embedding block) of Table4 are shown in comparison with E⁢5 𝐸 5 E5 italic_E 5 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e embeddings in Figures7 and8. Note that E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e is not a multilingual embedding model. Having a worse start as a multilingual embedding model, E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e also gets much weaker improvements of its multilingual qualities; it is consistent with our understanding of adiabatic tunings (Section3.2).

Table 4: Evaluations of the L⁢12 𝐿 12 L12 italic_L 12 query model tuned on MSMARCO as described in Section2.3. The notations are as in Table1.

Figure 7: Improvement of E⁢5 𝐸 5 E5 italic_E 5, L⁢12 𝐿 12 L12 italic_L 12 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e on XNLI entailment-neutral comparisons assessed by cosine.

Figure 8: Improvement of E⁢5 𝐸 5 E5 italic_E 5, L⁢12 𝐿 12 L12 italic_L 12 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e on XNLI entailment-contradiction comparisons assessed by cosine.

Appendix K Narrow-Domain Query Encoder

So far we observed that tuning the query encoder on data of a certain style (MSMARCO dataset) could preserve (or even improve) the encoder qualities which are not targeted by the tuning task, especially if we tune with a frozen embedding layer and low learning rate. Here we provide observations using more specialized datasets, based on arxiv-first (arxiv-first is described in Section2.2 and Appendix B):

1.ARXIV-math: uses only documents with at least one category which has the prefix "math."
2.ARXIV-physics: As above, but with "physics." as the prefix
3.ARXIV-cs: As above, but with "cs." as the prefix

E⁢5 𝐸 5 E5 italic_E 5 tuned on these narrow datasets using our ‘default’ regime (Section2.3) with frozen embedding block mostly improves the PND (positive-negatives discrepancy fraction) as shown in Table5. The improvements of these narrow-tuned encoders on individual language pairs, assessed by cosine, are shown in Figures9 and 10.

Table 5: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned on ARXIV-math, ARXIV-physics or ARXIV-cs with a frozen embedding block, batch size 14, margin 0.1 and learning rate 5e-8. When evaluated on ARXIV (columns arxiv-first and arxiv-title) the samples with category of the model (the first column) are excluded from the evaluation data.

Figure 9: Improvement of narrow-tuned encoders on XNLI entailment-neutral comparisons assessed by cosine.

Figure 10: Improvement of narrow-tuned encoders on XNLI entailment-contradiction comparisons assessed by cosine.

Appendix L Learning Rate

In Figure2 we have shown how the improvements of the E⁢5 𝐸 5 E5 italic_E 5 model depend on the learning rate. Here in Figure11 we compare similar data for L⁢12 𝐿 12 L12 italic_L 12 and E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e as well as a particular instance of E⁢5 𝐸 5 E5 italic_E 5 when both the query and text encoder are subject to tuning (as two independent encoders, with the same starting point) with the embedding block frozen in both encoders. The data confirm that while higher learning rates are not yet overtuning and still give higher gains on the test subset (of MSMARCO), it is the lower learning rates that better preserve and even improve those pretrained qualities which are not the goal of tuning.

Figure 11: Improvement of various models and tuning configurations on the English-only datasets (MSMARCO and ARXIV) in the left column and XNLI in the right column. Values that did not pass the two-tailed test (AppendixI) are shown with open markers. (a) Evaluations of the E⁢5 𝐸 5 E5 italic_E 5-full dual encoder after both encoders were tuned with a frozen embedding block, batch size 14 and margin 0.1. (b) Evaluations of the L⁢12 𝐿 12 L12 italic_L 12 query encoder tuned with a frozen embedding block, batch size 14 and margin 0.1. (c) Evaluations of the E⁢5⁢e 𝐸 5 𝑒 E5e italic_E 5 italic_e query encoder tuned with a frozen embedding block, batch size 14 and margin 0.1.

Appendix M Tuning Regime

M.1 Learning rate and batch size

M.1.1 Scaling rule

Table 6: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, margin 0.1 and 14000 samples per epoch. Linear scaling rule of learning rate with batch size. Values that did not pass the two-tailed test are shown in gray.

Table 7: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, margin 0.1 and 14000 samples per epoch. Square root scaling rule of learning rate with batch size. Values that did not pass the two-tailed test are shown in gray.

The learning rate is usually set with consideration to the batch size; it can be proportional to the batch size (linear scaling rule), or proportional to square root of the batch size (square root scaling rule)Krizhevsky (2014); Goyal et al. (2018); Hoffer et al. (2018). We show the evaluation results for these scaling rules in Tables6 and 7. While there is no essential wins in scaling batch size and learning rate up or down, the square root rule seems more reasonable in keeping the evaluation results approximately the same while increasing the batch size.

Regardless of the overall behavior of scaling the batch size and learning rate together, we have to verify that our default batch size 14 is a good fit for our default learning rate 5⁢e 5 𝑒 5e 5 italic_e-8 8 8 8. For this reason, a simple change of batch size, without altering learning rate, is considered in AppendixM.1.2; the tables8 and 9 show that our ‘default’ batch size is reasonable. The corresponding data for L⁢12 𝐿 12 L12 italic_L 12 are in AppendixM.1.3.

M.1.2 Encoder E5 and the batch size

Table 8: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 5e-8, margin 0.1 and different batch sizes (first column); 14000 samples per epoch. Values that did not pass the two-tailed test are shown in gray.

Table 9: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 5e-8, margin 0.1 and different batch sizes (first column); 1000 batches per epoch. Values that did not pass the two-tailed test are shown in gray.

In Table8 we show results for batch sizes 7, 14, 28, 56 and 112, while keeping the number of samples per epoch the same (14000). The row with batch 14 here coincides with the values for learning rate 5e-8 in Figure2, and with the row for the frozen embedding block in Table1. The results for all batch sizes are similar. Tuning with the higher batch size of 112 is a bit ‘safer’ for languages, not degrading any language pair when evaluated by cosine measure, and degrading only one language pair (for entailment vs. contradiction) when evaluated by distance measure. This comes at the price of lower gains on MSMARCO and ARXIV.

Table 10: Evaluations of the L⁢12 𝐿 12 L12 italic_L 12 query encoder tuned with a frozen embedding block, learning rate 5e-8, margin 0.1 and different batch sizes (first column); 14000 samples per epoch. Values that did not pass the two-tailed test are shown in gray.

Table 11: Evaluations of the L⁢12 𝐿 12 L12 italic_L 12 query encoder tuned with a frozen embedding block, learning rate 5e-8, margin 0.1 and different batch sizes (first column); 1000 batches per epoch. Values that did not pass the two-tailed test are shown in gray.

Table9 shows what happens if the number of batches per epoch (1000) is kept the same, rather than the number of samples. In this setting the larger batch size of 112 leads to a less frequent validation (by MSMARCO validation subset) at tuning and, effectively, to later and less reasonable stopping. This results in higher gains on MSMARCO test subset, but in far worse results on ARXIV and XNLI.

M.1.3 Encoder L12 and the batch size

The dependency of tuning L⁢12 𝐿 12 L12 italic_L 12 using different batch size is shown in Table10 (number of samples per epoch is 14000) and in Table11 (number of batches per epoch is 1000). Observations are somewhat similar to E⁢5 𝐸 5 E5 italic_E 5 (AppendixM.1.2), except that generally L⁢12 𝐿 12 L12 italic_L 12 does not perform as well as E⁢5 𝐸 5 E5 italic_E 5 and a batch size of 7 turns out to be bad for L⁢12 𝐿 12 L12 italic_L 12.

M.2 Weight decay

Table 12: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 1e-7, batch size 56, margin 0.1 and a range of weight decay (first column). Values that did not pass the two-tailed test are shown in gray.

Table 13: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 5e-8, batch size 14, margin 0.1 and a range of weight decay (first column). Values that did not pass the two-tailed test are shown in gray.

A weight decay may restrict increase of model weights, but it does not improve the evaluation results. We show some representative results in Tables12 and13. While restricting gains on the tuning goal, weight decay does not help to preserve the other qualities: the results on XNLI and ARXIV are no better than without weight decay. If there is any recipe for further improving the gains both on the tuning goal and on the related qualities, it has to be a less crude interference into the tuning.

Since weight decay may be more effective at higher learning rates, the parameters for Table12 are chosen at higher rate and batch size, compared to our ’default’ choice, which is used in Table13. The learning rates and batch sizes of these tables relate by square root scaling rule (see SectionM.1.1).

M.3 Candidate layers for freezing

In Section3.3 we showed how the adiabatic tuning range gets extended when the layer output.dense.weight is frozen (in all blocks). The reason for suspecting that this layer is the most responsible for breaking out of the original ‘minimum’ region, is that its maximal weight becomes the highest among all the layers as the learning rate gets closer to the end of the adiabatic range: see Table14. The maximal relative change of the weights is also achieved by the layer output.dense.weight: see Table15.

It is a crude adjustment, and freezing this layer in all blocks is probably overkill, but this did help us in extending the adiabatic range (Section3.3).

Table 14: The ’most changed’ two layers at each learning rate. The ’change’ is defined as the maximal weight of the layer if it was changed by the tuning. The prefix ’encoder.layer’ is removed from the layer names here.

Table 15: The ’most changed’ two layers at each learning rate. The ’change’ is defined as (W t−W o)/(W t+W o)subscript 𝑊 𝑡 subscript 𝑊 𝑜 subscript 𝑊 𝑡 subscript 𝑊 𝑜(W_{t}-W_{o})/(W_{t}+W_{o})( italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) / ( italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ), where W t subscript 𝑊 𝑡 W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the maximal weight of the layer in the tuned query encoder, and W o subscript 𝑊 𝑜 W_{o}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is the maximal weight of the layer in the original (untuned) encoder. The prefix ’encoder.layer’ is removed from the layer names here.

M.4 Margin of triple loss

When using the triplet loss for contrastive learning, the margin is an important parameter that can significantly affect model training. In Figure12 we show the dependency of the evaluation results on the margin during its tuning. We consider our default tuning parameters (Section2.3), but change the margin. The results are not unexpected: a margin up to 0.15 is reasonable, and at higher margins the disturbance on cross-lingual, and, eventually, on English data evaluation becomes too strong.

Figure 12: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 5e-8, batch size 14 using different triplet loss margins on (a) XNLI and (b) the English-only datasets (MSMARCO and ARXIV). Values that did not pass the two-tailed test are shown with open markers.

The corresponding data for L⁢12 𝐿 12 L12 italic_L 12 are given in Figure13. It shows that a margin of 0.1 works best for L⁢12 𝐿 12 L12 italic_L 12. The results for margin 0.1 are distinctly better. Altogether, L⁢12 𝐿 12 L12 italic_L 12 appears to be more sensitive (compared to E⁢5 𝐸 5 E5 italic_E 5) to the tuning parameters if the goal is to preserve performance on multilingual XNLI data and on out-of-domain ARXIV data. Arguably, the margin value of approximately 0.1 is the best both for L⁢12 𝐿 12 L12 italic_L 12 and E⁢5 𝐸 5 E5 italic_E 5.

Figure 13: Evaluations of the L⁢12 𝐿 12 L12 italic_L 12 query encoder tuned with a frozen embedding block, learning rate 5e-8, batch size 14 using different triplet loss margins on (a) XNLI and (b) the English-only datasets (MSMARCO and ARXIV). Values that did not pass the two-tailed test are shown with open markers.

M.5 Stopping criterion

In Table16 we show how the improvement depends on the stopping criterion. The stoppings after 5 or 10 non-improvement epochs give similar results. Stopping after 15 non-improvement epochs continues the trend of increased gain on English data, but with a deterioration on a few language pairs.

Table 16: Evaluations of the E⁢5 𝐸 5 E5 italic_E 5 query encoder tuned with a frozen embedding block, learning rate 5e-8, batch size 14 and triplet loss margin 0.1, stopped after different number of idle epochs (first column). The epoch is idle if no improvement is made. Values that did not pass the two-tailed test are shown in gray.

M.6 Execution time

There is no essential difference between the execution times for E⁢5 𝐸 5 E5 italic_E 5 and L⁢12 𝐿 12 L12 italic_L 12. The tuning time depends on how soon stopping happened. At the settings of interest (Section2.3,3.1,3.2), the tuning on an A100 GPU takes about one hour. For example, tuning 10 times at the default settings (Section2.3, AppendixE) for rates between 1e-8 and 1e-7 takes 9 hours. At higher rates, stopping occurs earlier; tuning 10 times for rates between 1.1e-7 to 2e-7 takes less than 5 hours. Table1 (with freezing different parts of the encoder) was obtained in 6 hours.

Evaluation of an encoder on all datasets we used (MSMARCO, ARXIV-first, ARXIV-title and XNLI) takes about 1.2-1.3 hours.

M.7 Effects of learning rate scheduler and weight decay

msmarco arxiv-first arxiv-title xnli ent-neutr xnli ent-contr B Sch D c%d%c%d%c%d%c+/-d+/-c+/-d+/- 100 Q 𝑄 Q italic_Q-1.83 2.80 0.99 1.23-0.26 0.96 200/1 172/3 69/71 43/110 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-0.41-1.07-0.29-0.83-0.26-0.96 0/63 0/39 0/20 1/8 64--1.58 2.28 0.78 1.07 0.78 1.20 178/1 156/3 72/50 48/87 Q 𝑄 Q italic_Q-0.05 0.19-0.34-0.59-0.26 0.00 0/0 0/0 0/0 0/1 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.13 0.17-0.23-0.32-0.26 0.00 0/0 0/0 0/0 0/1 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-0.75-1.47-0.62-0.78-0.52-0.48 0/67 0/48 0/66 1/42 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−5 superscript 10 5 10^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT-0.75-1.47-0.62-0.78-0.52-0.48 0/67 0/48 0/66 1/42 -10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.58 2.28 0.78 1.07 0.78 1.20 178/1 156/3 72/50 48/87 L 𝐿 L italic_L 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.30 0.55-0.31-0.11-0.78-0.48 0/0 0/0 0/4 0/9 32 Q 𝑄 Q italic_Q 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.12-0.17-0.29-0.43 0.26 0.00 0/0 0/0 0/0 0/0 L 𝐿 L italic_L 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-0.05-0.21-0.18-0.19 0.78 0.00 0/0 0/0 0/0 0/0 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.27 0.46-0.21-0.16 0.52 0.24 26/0 39/0 0/0 0/0 16 L 𝐿 L italic_L-0.00-0.15-0.10-0.56 0.78 0.24 0/0 0/0 4/0 10/0 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT--0.70-1.21-0.65-0.78 0.26-0.96 0/53 0/32 7/6 20/0 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-0.70-1.21-0.65-0.78 0.26-0.96 0/53 0/32 7/6 20/0 8 Q 𝑄 Q italic_Q 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-0.81-1.39-0.57-1.15-1.30-2.39 0/167 0/123 0/121 2/81 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-2.98-4.31-2.34-2.73-2.60-6.22 0/220 0/208 2/187 15/128 -10−5 superscript 10 5 10^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT-0.81-1.43-0.78-1.18-1.30-2.87 0/165 0/126 0/115 3/68 Q 𝑄 Q italic_Q 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-0.81-1.39-0.57-1.15-1.30-2.39 0/167 0/123 0/121 2/81 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.98-4.31-2.34-2.73-2.60-6.22 0/220 0/208 2/187 15/128

Table 17: Percentage improvement over the fine-tuned E5 model with a frozen embedding block and tuned using a batch size of 14 14 14 14, learning rate 5⁢e 5 𝑒 5e 5 italic_e-8 8 8 8 and a margin of 0.1 0.1 0.1 0.1. The blue colors indicate the top improvements whereas the red colors indicate the worse degradation. Three parameters are randomly varied: the batch size (denoted as “B”), the learning rate scheduler (denoted as “Sch”) and the weight decay (denoted as “D”). The learning rate schedulers are defined in Table19 with an initial learning rate of 5⁢e 5 𝑒 5e 5 italic_e-8 8 8 8. c% and d% refer to measuring the similarity of the text pairs using either the cosine similarity or the euclidean distance, respectively. For XNLI, (+)(+)( + ) indicates the number of language pairs that were improved while (−)(-)( - ) indicates those that have worsened out of a total of 225 language pairs. Note that only the statistically significant (determined by a Z-test) language pairs are retained and hence not all the improved/worsened counts sum to 225. Additionally, (ent-neutr) refers to entailment-entailment similarities compared with entailment-neutral similarities whereas (ent-contr) refers to comparisons against entailment-contradiction similarities.

msmarco arxiv-first arxiv-title xnli ent-neutr xnli ent-contr B Sch D c%d%c%d%c%d%c+/-d+/-c+/-d+/- 100 Q 𝑄 Q italic_Q--1.12-1.17-0.82-0.14-0.52-1.71 0/138 3/117 15/65 61/40 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-0.31-0.18-0.39-0.05-0.52-1.22 0/0 0/0 0/11 0/7 64---1.20-1.05-0.79-0.11-0.26-0.73 0/57 5/34 1/80 9/47 Q 𝑄 Q italic_Q--0.72-0.89-0.66-0.14-0.26-1.71 0/65 4/43 1/62 21/33 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-0.96-1.13-0.66-0.49-0.52-1.22 0/80 6/56 3/65 29/35 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-1.78-2.36-0.97-1.35 0.78-1.46 1/160 9/131 22/103 73/68 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−5 superscript 10 5 10^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT-1.78-2.36-0.97-1.35 0.78-1.46 1/160 9/131 22/103 73/68 -10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-1.20-1.05-0.79-0.11-0.26-0.73 0/57 5/34 1/80 9/47 L 𝐿 L italic_L 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-0.76-0.80-0.58-0.16-0.26-1.46 0/61 3/32 0/58 14/32 32 Q 𝑄 Q italic_Q 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.12-3.35-1.47-1.60 0.00-2.68 2/190 8/162 41/107 93/69 L 𝐿 L italic_L 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.05-3.36-1.45-1.33 0.00-2.68 2/189 8/158 41/102 94/68 E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.12-3.54-1.42-1.84 0.00-2.68 2/192 8/163 41/110 93/71 16 L 𝐿 L italic_L--0.02 0.13-0.58-0.65-1.04-1.22 0/0 0/0 11/0 43/0 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT--2.52-3.32-1.32-1.38-0.26-2.20 3/186 8/164 49/86 101/55 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.52-3.32-1.32-1.38-0.26-2.20 3/186 8/164 49/86 101/55 8 Q 𝑄 Q italic_Q 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-2.05-3.29-1.11-1.38-0.26-3.66 0/212 3/182 21/140 70/105 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−6 superscript 10 6 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT-2.44-3.81-1.55-1.78-0.52-3.41 0/213 4/182 24/135 76/97 -10−5 superscript 10 5 10^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT-2.03-3.19-0.74-1.57-0.26-3.41 0/209 3/182 20/139 71/103 Q 𝑄 Q italic_Q 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.05-3.29-1.11-1.38-0.26-3.66 0/212 3/182 21/140 70/105 E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT-2.44-3.81-1.55-1.78-0.52-3.41 0/213 4/182 24/135 76/97

Table 18: Percentage improvement over the fine-tuned E5 model with a frozen embedding block and tuned using a batch size of 14 14 14 14, learning rate 10−7 superscript 10 7 10^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and a margin of 0.1 0.1 0.1 0.1. The blue colors indicate the top improvements whereas the red colors indicate the worse degradation. Three parameters are randomly varied: the batch size (denoted as “B”), the learning rate scheduler (denoted as “Sch”) and the weight decay (denoted as “D”). The learning rate schedulers are defined in Table19 with an initial learning rate of 10−7 superscript 10 7 10^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. c% and d% refer to measuring the similarity of the text pairs using either the cosine similarity or the euclidean distance, respectively. For XNLI, (+)(+)( + ) indicates the number of language pairs that were improved while (−)(-)( - ) indicates those that have worsened out of a total of 225 language pairs. Note that only the statistically significant (determined by a Z-test) language pairs are retained and hence not all the improved/worsened counts sum to 225. Additionally, (ent-neutr) refers to entailment-entailment similarities compared with entailment-neutral similarities whereas (ent-contr) refers to comparisons against entailment-contradiction similarities.

Table 19: The definitions of the various learning rate schedulers used in Table18 where t 𝑡 t italic_t is the current training step, T 𝑇 T italic_T, the total number of training steps and α 0 subscript 𝛼 0\alpha_{0}italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the initial learning rate.

Using the fine-tuned E⁢5 𝐸 5 E5 italic_E 5 model with the frozen embedding block, tuned using a batch size of 14, and a margin of 0.1, we randomly vary the batch size, learning rate scheduler and weight decay in order to assess their impact on the model’s final performance. In Table17 we present the change in performance across these different configurations for a learning rate of 5×10−8 5 superscript 10 8 5\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, which is our ‘default’ learning rate (Section2.3). In Table18 we do the same for a learning rate of 10−7 superscript 10 7 10^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. Table19 lists the different schedulers we considered. Values in blue indicate the top improvements whereas values in red indicate the worse degradation.

Across these parameters, on average, the batch size appears to have the most significant impact, generally leading to poorer performance as the batch size is decreased. Within each batch size group, we see that using an exponential learning rate scheduler (E 0.95 subscript 𝐸 0.95 E_{0.95}italic_E start_POSTSUBSCRIPT 0.95 end_POSTSUBSCRIPT or E 0.98 subscript 𝐸 0.98 E_{0.98}italic_E start_POSTSUBSCRIPT 0.98 end_POSTSUBSCRIPT) is generally worse than using any of the other schedulers or no scheduler at all. A specific exception exists when using a batch size of 100 where the exponential scheduler outperforms the quadratic one when the learning rate is set to 10−7 superscript 10 7 10^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. Across all the configurations considered, the most impact seems to be the one shown in the first row of Table17, where we see good improvement over MSMARCO and ARXIV-first while simultaneously showing improvement over XNLI ent-neutr.

M.8 Varying the optimizer and learning rate

Table20 shows the effects of choosing a different optimizer with a small and large learning rate. In addition to Adamax, we tried Adadelta and Stochastic Gradient Descent (SGD), both of which did not change the model weights in a significant enough way to affect the overall performance and hence, are not presented. For higher learning rates, SGD without momentum did elicit a change as shown in Fig.14, but the trend in performance is similar to what is presented in Fig.2 with higher resolution near the transition point between improved and degraded multilingual performance. At around 9×10−7 9 superscript 10 7 9\times 10^{-7}9 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, we see a sharp increase in the number of degraded language pairs while the model maintains constant improvement on MSMARCO. With a high enough learning rate, it seems that the gradients are able to overcome a barrier in the loss landscape that confined the weights to a region in which multilingual characteristics were preserved.

Table 20: Percentage improvement over the untuned E⁢5 𝐸 5 E5 italic_E 5 model. O, M and LR represent the choice of optimizer, whether or not momentum was used and the learning rate, respectively. All the models here are tuned with a batch size of 14 14 14 14, margin 0.1 0.1 0.1 0.1, and a frozen embedding block. Adamax with no momentum corresponds to choosing β 1=β 2=0 subscript 𝛽 1 subscript 𝛽 2 0\beta_{1}=\beta_{2}=0 italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 for the optimizer parameters.

Figure 14: Evaluations on (a) XNLI and (b) the English-only datasets (MSMARCO and ARXIV) of the E5 query encoder tuned with a frozen embedding block, batch size 14, margin 0.1 using different learning rates. Here we tune using SGD without momentum. Values that did not pass the two-tailed test are shown with open markers.

From the table, the default version of Adamax (Adamax with momentum) has a nearly negligible effect on the model when used with a small learning rate, suggesting that this particular configuration for the optimizer forces the model weights to change very slowly. When momentum is switched off, the model weights change enough to improve the overall performance in both English and other languages. Continuing down to the bottom row, if we turn up the learning rate to a higher value, the model weights begin to change more significantly which brings about less improvement in the model’s multilingual capacity (still an improvement nonetheless), but maintains the same improvement on English. Overall, going from the first row to the last row (for Adamax), we transition from a point in model weight space where performance on all languages can be enhanced or preserved to a point which is better suited for the English-only task defined in tuning.

Xet Storage Details

Size:: 100 kB
Xet hash:: fe3a7d0a218bf900bd630ddde342b71d6c1e29f63437b645d1ad8242a12649c0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.