Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 12
This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Pravallika6/cross-domain-embeddings")
# Run inference
sentences = [
'Are there any frameworks that adapt to different types of image segmentation tasks?',
'[ABSTRACT]\nWe propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models\ntrained with AGs implicitly learn to suppress irrelevant regions in an input image\nwhile highlighting salient features useful for a specific task. This enables us to\neliminate the necessity of using explicit external tissue/organ localisation modules\nof cascaded convolutional neural networks (CNNs). AGs can be easily integrated\ninto standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy.\nThe proposed Attention U-Net architecture is evaluated on two large CT abdominal\ndatasets for multi-class image segmentation. Experimental results show that AGs\nconsistently improve the prediction performance of U-Net across different datasets\nand training sizes while preserving computational efficiency. The source code for\nthe proposed architecture is publicly available.\n\n[1 Introduction]\nAutomated medical image segmentation has been extensively studied in the image analysis community\ndue to the fact that manual, dense labelling of large amounts of medical images is a tedious and\nerror-prone task. Accurate and reliable solutions are desired to increase clinical work flow efficiency\nand support decision making through fast and automatic extraction of quantitative measurements.\nWith the advent of convolutional neural networks (CNNs), near-radiologist level performance can\nbe achieved in automated medical image analysis tasks including cardiac MR segmentation [3] and\ncancerous lung nodule detection [17]. High representation power, fast inference, and filter sharing\nproperties have made CNNs the de facto standard for image segmentation. Fully convolutional\nnetworks (FCNs) [ 18] and the U-Net [ 24] are two commonly used architectures. Despite their\ngood representational power, these architectures rely on multi-stage cascaded CNNs when the target\norgans show large inter-patient variation in terms of shape and size. Cascaded frameworks extract a\nregion of interest (ROI) and make dense predictions on that particular ROI. The application areas\ninclude cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27] segmentation, and lung CT\nnodule detection [17]. However, this approach leads to excessive and redundant use of computational\nresources and model parameters; for instance, similar low-level features are repeatedly extracted by\nall models within the cascade. To address this general problem, we propose a simple and yet effective\nsolution, namely attention gates(AGs). CNN models with AGs can be trained from scratch in a\nstandard way similar to the training of a FCN model, and AGs automatically learn to focus on target\n1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\narXiv:1804.03999v3 [cs.CV] 20 May 2018\nstructures without additional supervision. At test time, these gates generate soft region proposals\nimplicitly on-the-fly and highlight salient features useful for a specific task. Moreover, they do not\nintroduce significant computational overhead and do not require a large number of model parameters\nas in the case of multi-model frameworks. In return, the proposed AGs improve model sensitivity and\naccuracy for dense label predictions by suppressing feature activations in irrelevant regions. In this\nway, the necessity of using an external organ localisation model can be eliminated while maintaining\nthe high prediction accuracy. Similar attention mechanisms have been proposed for natural image\nclassification [11] and captioning [1] to perform adaptive feature pooling, where model predictions\nare conditioned only on a subset of selected image regions. In this paper, we generalise this design\nand propose image-grid based gating that allows attention coefficients to be specific to local regions.\nMoreover, our approach can be used for attention-based dense predictions.\nWe demonstrate the implementation of AG in a standard U-Net architecture (Attention U-Net) and\napply it to medical images. We choose the challenging CT pancreas segmentation problem to provide\nexperimental evidence for our proposed contributions. This problem constitutes a difficult task due to\nlow tissue contrast and large variability in organ shape and size. We evaluate our implementation on\ntwo commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class abdominal CT-150.\nThe results show that AGs consistenly improve prediction accuracy across different datasets and\ntraining sizes while achieving state-of-the-art performance without requiring multiple CNN models.\n\n[2 Methodology]\nFully Convolutional Network (FCN):Convolutional neural networks (CNNs) outperform traditional approaches in medical image analysis on public benchmark datasets [14, 17] while being an\norder of magnitude faster than, e.g., graph-cut and multi-atlas segmentation techniques [34]. This\nis mainly attributed to the fact that (I) domain specific image features are learnt using stochastic\ngradient descent (SGD) optimisation, (II) learnt kernels are shared across all pixels, and (III) image\nconvolution operations exploit the structural information in medical images well. In particular, fully\nconvolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13] and holistically nested\nnetworks [16, 35] have been shown to achieve robust and accurate performance in various tasks\nincluding cardiac MR [3], brain tumours [12] and abdominal CT [26, 27] image segmentation tasks.\nConvolutional layers progressively extract higher dimensional image representations ( xl) by processing local information layer by layer. Eventually, this separates pixels in a high dimensional\nspace according to their semantics. Through this sequential process, model predictions are conditioned on information collected from a large receptive field. Hence, feature-map xl is obtained\nat the output of layer lby sequentially applying a linear transformation followed by a non-linear\nactivation function. It is often chosen as rectified linear unit: σ1 ( xl\ni,c) = max(0 ,xl\ni,c) where iand\nc denote spatial and channel dimensions respectively. Feature activations can be formulated as:\nxl\nc = σ1\n(∑\nc′∈Fl\nxl−1\nc′ ∗kc′,c\n)\nwhere ∗denotes the convolution operation, and the spatial subscript\n(i) is omitted in the formulation for notational clarity. The function f( xl; Φ l) = x(l+1) applied in\nconvolution layer lis characterised by trainable kernel parameters Φ l. The parameters are learnt by\n\nAttention GateLorem ipsum dolor sit amet,consectetur adipisicing elit, seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.\n x x x \n x x x \nReLU \nx \n \nSigmoid Resampler\nx x \nFigure 2: Schematic of the proposed additive attention gate (AG). Input features ( xl) are scaled\nwith attention coefficients (α) computed in AG. Spatial regions are selected by analysing both the\nactivations and contextual information provided by the gating signal (g) which is collected from a\ncoarser scale. Grid resampling of attention coefficients is done using trilinear interpolation.\nminimising a training objective, e.g. cross-entropy loss, using stochastic gradient descent (SGD).\nIn this paper, we build our attention model on top of a standard U-Net architecture. U-Nets are\ncommonly used for image segmentation tasks because of their good performance and efficient use\nof GPU memory. The latter advantage is mainly linked to extraction of image features at multiple\nimage scales. Coarse feature-maps capture contextual information and highlight the category and\nlocation of foreground objects. Feature-maps extracted at multiple scales are later merged through\nskip connections to combine coarse- and fine-level dense predictions as shown in Figure 1.\nAttention Gates for Image Analysis:To capture a sufficiently large receptive field and thus, semantic contextual information, the feature-map grid is gradually downsampled in standard CNN\narchitectures. In this way, features on the coarse spatial grid level model location and relationship\nbetween tissues at global scale. However, it remains difficult to reduce false-positive predictions for\nsmall objects that show large shape variability. In order to improve the accuracy, current segmentation\nframeworks [14, 26, 27] rely on additional preceding object localisation models to simplify the task\ninto separate localisation and subsequent segmentation steps. Here, we demonstrate that the same\nobjective can be achieved by integrating attention gates (AGs) in a standard CNN model. This does\nnot require the training of multiple models and a large number of extra model parameters. In contrast\nto the localisation model in multi-stage CNNs, AGs progressively suppress feature responses in\nirrelevant background regions without the requirement to crop a ROI between networks.\nAttention coefficients, αi ∈[0,1], identify salient image regions and prune feature responses to\npreserve only the activations relevant to the specific task as shown in Figure 3a. The output of AGs is\nthe element-wise multiplication of input feature-maps and attention coefficients: ˆxl\ni,c = xl\ni,c ·αl\ni. In\na default setting, a single scalar attention value is computed for each pixel vector xl\ni ∈RFl where\nFl corresponds to the number of feature-maps in layer l. In case of multiple semantic classes, we\npropose to learn multi-dimensional attention coefficients. This is inspired by [ 29], where multidimensional attention coefficients are used to learn sentence embeddings. Thus, each AG learns to\nfocus on a subset of target structures. As shown in Figure 2, a gating vector gi ∈RFg is used for\neach pixel ito determine focus regions. The gating vector contains contextual information to prune\nlower-level feature responses as suggested in [32], which uses AGs for natural image classification.\nWe use additive attention [2] to obtain the gating coefficient. Although this is computationally more\nexpensive, it has experimentally shown to achieve higher accuracy than multiplicative attention [19].\nAdditive attention is formulated as follows:\nql\natt = ψT (\nσ1 ( WT\nx xl\ni + WT\ng gi + bg)\n)\n+ bψ (1)\nαl\ni = σ2( ql\natt(xl\ni, gi; Θatt) ), (2)\nwhere σ2(xi,c) = 1\n1+exp(−xi,c) correspond to sigmoid activation function. AG is characterised\nby a set of parameters Θatt containing: linear transformations Wx ∈RFl×Fint, Wg ∈RFg×Fint,\nψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations are computed using\nchannel-wise 1x1x1 convolutions for the input tensors. In other contexts [33], this is referred to as\nvector concatenation-based attention, where the concatenated features xl and gare linearly mapped\nto a RFint dimensional intermediate space. In image captioning [1] and classification [11] tasks, the\n\nFigure 3(a): From left to right (a-e, f-j): Axial and sagittal views of a\n3D abdominal CT scan, attention coefficients, feature activations of\na skip connection before and after gating. Similarly, (k-n) visualise\nthe gating on a coarse scale skip connection. The filtered feature\nactivations (d-e, i-j) are collected from multiple AGs, where a subset\nof organs is selected by each gate. Activations shown in (d-e, i-j)\nconsistently correspond to specific structures across different scans.\nFigure 3(b): The ground-truth pancreas\nsegmentation (a) is highlighted in blue\n(b). Similarly, U-Net model prediction\n(c) and the predictions obtained with Attention U-Net (d) are shown. The missed\ndense predictions by U-Net are highlighted with red arrows.\nsoftmax activation function is used to normalise the attention coefficients (σ2); however, sequential\nuse of softmax yields sparser activations at the output. For this reason, we choose a sigmoid activation\nfunction. This results experimentally in better training convergence for the AG parameters. In contrast\nto [11] we propose a grid-attention technique. In this case, gating signal is not a global single vector\nfor all image pixels but a grid signal conditioned to image spatial information. More importantly, the\ngating signal for each skip connection aggregates information from multiple imaging scales, as shown\nin Figure 1, which increases the grid-resolution of the query signal and achieve better performance.\nLastly, we would like to note that AG parameters can be trained with the standard back-propagation\nupdates without a need for sampling based update methods used in hard-attention [21].\nAttention Gates in U-Net Model:The proposed AGs are incorporated into the standard U-Net\narchitecture to highlight salient features that are passed through the skip connections, see Figure\n1. Information extracted from coarse scale is used in gating to disambiguate irrelevant and noisy\nresponses in skip connections. This is performed right before the concatenation operation to merge\nonly relevant activations. Additionally, AGs filter the neuron activations during the forward pass as\nwell as during the backward pass. Gradients originating from background regions are down weighted\nduring the backward pass. This allows model parameters in shallower layers to be updated mostly\nbased on spatial regions that are relevant to a given task. The update rule for convolution parameters\nin layer l−1 can be formulated as follows:\n∂(ˆxl\ni)\n∂(Φl−1) = ∂\n(\nαl\nif(xl−1\ni ; Φl−1)\n)\n∂(Φl−1) = αl\ni\n∂(f(xl−1\ni ; Φl−1))\n∂(Φl−1) + ∂(αl\ni)\n∂(Φl−1) xl\ni (3)\nThe first gradient term on the right-hand side is scaled with αl\ni. In case of multi-dimensional AGs, αl\ni\ncorresponds to a vector at each grid scale. In each sub-AG, complementary information is extracted\nand fused to define the output of skip connection. To reduce the number of trainable parameters\nand computational complexity of AGs, the linear transformations are performed without any spatial\nsupport (1x1x1 convolutions) and input feature-maps are downsampled to the resolution of gating\nsignal, similar to non-local blocks [ 33]. The corresponding linear transformations decouple the\nfeature-maps and map them to lower dimensional space for the gating operation. As suggested in\n[11], low-level feature-maps, i.e. the first skip connections, are not used in the gating function since\nthey do not represent the input data in a high dimensional space. We use deep-supervision [16] to\nforce the intermediate feature-maps to be semantically discriminative at each image scale. This helps\nto ensure that attention units, at different scales, have an ability to influence the responses to a large\nrange of image foreground content. We therefore prevent dense predictions from being reconstructed\nfrom small subsets of skip connections.\n\n[4 Discussion And Conclusion]\nIn this paper, we presented a novel attention gate model applied to medical image segmentation. Our\napproach eliminates the necessity of applying an external object localisation model. The proposed\napproach is generic and modular as such it can be easily applied to image classification and regression\nproblems as in the examples of natural image analysis and machine translation. Experimental\nresults demonstrate that the proposed AGs are highly beneficial for tissue/organ identification and\nlocalisation. This is particularly true for variable small size organs such as the pancreas, and similar\nbehaviour is expected for global classification tasks.\nTraining behaviour of the AGs can benefit from transfer learning and multi-stage training schemes.\nFor instance, pre-trained U-Net weights can be used to initialise the attention network, and gates can\nbe trained accordingly in the fine-tuning stage. Similarly, there is a vast body of literature in machine\nlearning exploring different gating architectures. For example, highway networks [7] make use of\nresidual connections around the gate block to allow better gradient backpropagation and slightly\nsofter attention mechanisms. Although our experiments with residual connections have not provided\nany significant performance improvement, future research will focus on this aspect to obtain a better\ntraining behaviour. Lastly, we note that with the advent of improved GPU computation power and\nmemory, larger capacity 3D models can be trained with larger batch sizes without the need for image\ndownsampling. In this way, we would not need to utilise ad-hoc post-processing techniques to further\nimprove the state-of-the-art results. Similarly, the performance of Attention U-Net can be further\nenhanced by utilising fine resolution input batches without additional heuristics. Lastly, we would\nlike to thank to Salim Arslan and Dan Busbridge for their helpful comments on this work.',
'[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoffmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After fine-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adifferentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scfit pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9d𝑖º. Neural\nnetworks have proven to be powerful language models, first in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe benefits of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring efficient means of augmenting language\nmodels with a massive-scale memory without significantly increasing computations. Specifically, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkisthefirsttoshowthebenefits\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoffmann|sifre}@deepmind.com\narXiv:2112.04426v3 [cs.CL] 7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be fine-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method first\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simplified version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the first chunk only affect the last token of the first chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we filter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-fly is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We find that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three different residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classification\nhead defines an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a differentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\x001º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we first split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\x001 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\x001¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the first\n𝑚\x001 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\x001 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\x001 , C/a.sc¹ℎ𝑢𝑚¸𝑖\x001\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the first to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\x001º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢 A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢 C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢 F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻 E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻 A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻 C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻 F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is defined in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we define\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe first𝑚\x001 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndefine C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \x001¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simplified implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\x001º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9d𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are flexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more efficient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeffStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close offtheir noses, ears, and\nin the large pond off Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nflage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, fl es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their flat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof thefirst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their flat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting differences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.4898, -0.0854],
# [ 0.4898, 1.0000, -0.0389],
# [-0.0854, -0.0389, 1.0000]])
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What methods do language models use to predict mutation effects on proteins? |
[ABSTRACT] |
How can I efficiently model long sequences in machine learning? |
[ABSTRACT] |
What methods exist for learning from interconnected datasets? |
[ABSTRACT] |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32multi_dataset_batch_sampler: round_robinper_device_train_batch_size: 32num_train_epochs: 3max_steps: -1learning_rate: 5e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1label_smoothing_factor: 0.0bf16: Falsefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: noper_device_eval_batch_size: 32prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
sentence-transformers/all-MiniLM-L6-v2