Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 15
How to use Pravallika6/cross-domain-embeddings with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Pravallika6/cross-domain-embeddings")
sentences = [
"Can someone explain how to model dependencies in relational information?",
"[Preamble]\nRodr´ ıguez et al.\nGeneralized Boozer coordinates: a natural coordinate system for\nquasisymmetry\nE. Rodr ´ ıguez,1, 2,a) W. Sengupta,1, 2 and A. Bhattacharjee 1, 2\n1)Department of Astrophysical Sciences, Princeton University, Princeton, NJ,\n\n2)Princeton Plasma Physics Laboratory, Princeton, NJ, 08540\n(Dated: 22 September 2021)\nWe prove the existence of a straight-field-line coordinate system we call generalized Boozer coordinates. This\ncoordinate system exists for magnetic fields with nested toroidal flux surfaces provided\n∮\ndl/B(j ·∇ψ) = 0,\nwhere symbols have their usual meaning, and the integral is taken along closed magnetic field lines. All\nquasisymmetric fields, regardless of their associated form of equilibria, must satisfy this condition. This\ncoordinate system presents itself as a convenient form in which to describe general quasisymmetric configurations and their properties. Insight can be gained analytically into the difference between strong and weak\nforms of quasisymmetry, as well as axisymmetry, and the interaction of quasisymmetry with different forms\nof equilibria.\nI. INTRODUCTION:\nQuasisymmetric stellarators1–3 (sometimes referred to\nas helically-symmetric4,5) constitute an attractive choice\nfor magnetic confinement fusion. Theoretically, such\ndesigns exhibit transport properties analogous to those\nof axisymmetric devices 6 while possessing larger threedimensional freedom. The latter allows avoiding some\nof the inherent limitations of tokamaks. The quasisymmetric stellarator achieves this by possessing a hidden\nsymmetry, namely, the magnitude of the magnetic field\n|B|is symmetric, while the full vector B is not.\nThe concept of quasisymmetry (QS) is elegant, but\nit appears to have a significant theoretical limitation. It\nwas soon realized7 that constructing a configuration with\nexact QS is not possible. Although there is no definitive\nproof that shows this impossibility, work on near-axis\nexpansions7,8 supports this point of view. The governing\nsystem of equations is overdetermined: there are more\nconstraints than degrees of freedom. This limitation does\nnot, however, prevent designs that exhibit behavior close\nto QS in a volume. 9,10\nRecently, studies that explore the concept of QS more\ndeeply have appeared.3,11 The main idea has been to separate the concept of QS from that of macroscopic force\nbalance. In these studies, QS is defined as a property of\nthe configuration that confers the single-particle dynamics an approximately conserved quantity without making any statement about the form of macroscopic equilibrium. This perspective differs significantly from the\nstandard approach. Prior to [3] and [12] (and with very\nfew exceptions5,13), the concept of QS was framed in the\ncontext of magnetohydrostatic (MHS) equilibria satisfying j ×B = ∇p, where j = ∇×B is the current density\nand p is the scalar pressure. As MHS is not intrinsic to\na)Author to whom correspondence should be addressed: eduardor@princeton.edu\nQS, it is important to define QS without reference to a\nparticular form of equilibrium. Separating QS from equilibrium allows us to investigate more deeply its meaning\nand limitations.\nAbandoning the convenient form of MHS equilibrium,\nalthough conceptually appropriate, comes at a cost. The\nmagnetic field no longer needs to satisfy the condition j ·∇p = 0 , implicitly assumed in most of the\nliterature1,4,6. Hence, one cannot guarantee the existence\nof Boozer coordinates 2 as presently understood, even if\nmagnetic flux surfaces exist. Boozer coordinates are particularly convenient for studying QS, as it presents the\nsymmetry in an explicit, simple form. This leads to a\nsearch for an analogous, convenient, but more general\ncoordinate system for quasisymmetric configurations.\nIn this paper, we construct such a coordinate system,\nwhich we call generalized Boozer coordinates (GBC).\nThis coordinate system was used to formulate the nearaxis expansion in [14], in which a QS equilibrium with\nanisotropic pressure was shown to avoid the conventional\nproblem of overdetermination. The present paper is organized as follows. First, we introduce, develop and discuss\nthis coordinate system systematically and rigorously. We\nstart by presenting a constructive proof for GBC and the\nclass of fields for which it exists. We then present the set\nof equations describing a quasisymmetric magnetic field\nin this coordinate system. This gives us the opportunity\nto gain an alternative 3,12 perspective on the distinction\nbetween the so-called weak and strong forms of quasisymmetry, as well as a comparison to axisymmetry. We close\nby summarizing the equations that link the equilibrium\nand the quasisymmetric field and concluding remarks.\nII. GENERALIZED BOOZER COORDINATES\nA. Explicitly symmetric formulation of quasisymmetry\nLet us start this section by introducing the notion of\nQS. We consider QS from the recent general perspective\narXiv:2106.08266v2 [physics.plasm-ph] 21 Sep 2021\nRodr´ ıguez et al. 2\nof single particle motion. 3,11 QS (and in particular weak\nQS) is the property of the fields that confers the motion\nof guiding-centre single particles with an approximately\nconserved quantity.3 For the dynamics of particles to exhibit this conservation it is necessary for the magnetic\nfield to have nested flux surfaces B ·∇ψ= 0 and satisfy,\n∇ψ×∇B·∇(B ·∇B) = 0. (1)\nThis form of quasisymmetry is most commonly referred\nto as the triple vector product formulation. Although\nthis is not the form that comes directly from the singleparticle analysis (that form is the one used in Sec. IIIA),\nit is a succinct way to impose QS on a magnetic field.\nGiven that this generalized concept of QS has a single\nparticle origin, no notion of macroscopic equilibrium is\ninvolved. Of course, from a practical point of view, any\nsteady field of interest will be in some form of force balance. With the definition of QS in (1), different forms of\nequilibria may be investigated to understand how they\ninteract with QS. One may generally refer to the macroscopic forces by F, and we do not attempt to assess their\norigin in this paper. Instead, we focus on the requirements at a fluid level. A complete view of the problem would require an investigation of the kinetic basis\nof the forces to link the fluid forces to microscopic behavior. This is particularly important as microscopic\nforces could break the symmetry (and with it, the QSrelated conserved momenta). An important example is\nthe electrostatic potential, whose symmetry is needed to\nthe appropriate order and imposes constraints even on\nthe forces that arise from the plasma, such as centrifugal\nforce or anisotropic pressure. With this in mind, we shall\nfocus on the macroscopic aspects.\nLooking at the statement of QS in the form presented\nin Eq. (1), the existence of a symmetry in the problem\nis not apparent. In the usual context of j ×B = ∇p,\nthis absence of obvious symmetry can be amended by\nusing Boozer coordinates. In these coordinates, a field in\nmagnetostatic equilibrium with well-defined flux surfaces\nis quasisymmetric (aside from quasipoloidal symmetry)\niff,\nB = B(ψ,θ −˜αφ), (2)\nwhere {ψ,θ,φ }are Boozer coordinates, ψthe flux surface\nlabel, θ and φ poloidal and toroidal angles respectively,\nand ˜α = N/M|M,N ∈Z. In order to employ Boozer\ncoordinates, the j·∇ψ= 0 property of MHS equilibria is\ncentral. Boozer coordinates have the standard straight\nfield-line coordinate Jacobian\nJ= Bφ + ιBθ\nB2 , (3)\nwhere the covariant Bθ and Bφ are flux functions,\nB = Bθ(ψ)∇θ+ Bφ(ψ)∇φ+ Bψ∇ψ.\nBoozer coordinates are widely used in stellarator theory\nand applications. The coordinates are a natural set that\nsimplify many of the governing equations, including QS.\nIn particular, construction of solutions by near-axis expansion of three-dimensional fields is most convenient in\nthese coordinates7,8,14–16.\nIn the context of our general quasisymmetric B, it is\nnot necessary to assume that j ·∇ψ = 0. However, we\ndemonstrate that any quasisymmetric field must satisfy∮\n(dl/B)(j ·∇ψ) = 0 (see Appendix A). The question\nthat naturally arises is, can we construct an appropriate straight field line coordinate system that explicitly\nexpresses the symmetry in QS in the Boozer fashion?\nThe answer is yes. To see this, we write Eq. (1) in\nstraight-field line coordinates {ψ,θ,φ } using the contravariant representation B = ∇ψ×(∇θ−ι∇φ),\n(∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB)·\n·∇(J−1(∂φB+ ι∂θB)) = 0,\nwhere J is the Jacobian associated to the coordinate\nsystem chosen. Now assume we can construct a straightfield line coordinate system with a Jacobian of the form\nJ= J(ψ,B), then the Jacobian factor may be dropped\nfrom the equation above to obtain,\n(∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB) ·∇(∂φB+ ι∂θB) = 0\ni.e.\n[\n∂φ −\n(∂φB\n∂θB\n)\n∂θ\n]\n(∂φ + ι∂θ)B = 0\nassuming that ∂θB ̸= 0. (This makes quasi-poloidally\nsymmetric solutions a special case.) From near-axis expansion we know that these solutions have a very restricted QS region3.\nCommuting operators, we obtain\n(∂φ + ι∂θ)\n(∂φB\n∂θB\n)\n∂θB = (b ·∇)\n(∂φB\n∂θB\n)\n∂θB = 0\nwhich implies that B = B(ψ,θ −˜αφ) or B = B(ψ,φ),\nwhere ˜α= −∂φB/∂θB is a flux function. The additional\nrequirement that ˜α is rational to avoid B = B(ψ) at\nnon-degenerate surfaces requires ˜α to be constant.3\nIn summary, if we are able to construct a straight field\nline coordinate system that has a Jacobian that can be\nwritten as a function of ψ and B only, then a field is QS\nin the weak sense if and only if B depends on a single\nlinear combination of those coordinate angles. Note that\nunder this assumption, the reverse direction of the proof\nis straightforward.\nB. Constructing generalized Boozer coordinates\nWe define generalized Boozer coordinates(GBC) as a\nset of straight field line coordinates whose Jacobian can\nbe written in the form J= Bα(ψ)/B2, where Bα is some\nflux function without requiring j ·∇ψ to vanish identically. This choice is more general than what is permitted\nby Boozer coordinates, which separately requires Bθ and\nRodr´ ıguez et al. 3\nBφ to be flux functions. We shall now constructively explore the conditions under which such a coordinate system exists.\nLet us start with a given magnetic field, assuming that\nit lies on well-defined flux surfaces. It then follows 2,17,18\nthat there exists some initial straight field line coordinate\nsystem {ψ,θ,φ }, so that B = ∇ψ×∇θ+ι∇φ×∇ψ. The\ndefinition of the angular straight field line coordinates is\narbitrary up to a transformation of the form\nθ= θ′+ ιω,\nφ= φ′+ ω.\nThe function ω = ω(ψ,θ,φ ) is a well behaved periodic\nfunction that defines a family of transformations 18. The\nperiodicity of ω preserves the toroidal and poloidal character of the two angular coordinates.\nStarting with some given straight-field line coordinate\nsystem, we want to understand how to construct a set\nwith a Jacobian of the GBC form. Thus, we need to\nknow how to transform the coordinate Jacobian induced\nby ω. This reads\nJ′−1 = ∇ψ×∇(θ−ιω) ·∇(φ−ω).\nHere {ψ,θ′,φ′}represents the newly defined straight field\nline coordinates whose associated Jacobian is J′. This\nequation may be recast into the form of a magnetic differential equation,\nB ·∇ω= 1\nJ − 1\nJ′. (4)\nNow, let us require19\nJ′= Bα(ψ)\nB2 . (5)\nIn order for the magnetic differential equation Eq. (4)\nto have a single-valued solution for the transformation\nfunction ω, Eq. (4) must satisfy Newcomb’s criterion 20.\nAccording to Newcomb, for a magnetic differential equation B ·∇f = s to have a single-valued solution for f,\nthe source term smust satisfy the line-integral condition\nalong a field line,\n∮\nsdl\nB = 0. (6)\nFor our problem s= 1/J− 1/J′. Start with,\n∮ 1\nJ\ndl\nB =\n∮\nB ·∇φdl\nB =\n∮\ndφ= 2πn.\nHere we have used the definition of the Jacobian in terms\nof the magnetic fieldB·∇φ= ∇ψ×∇θ·∇φ= 1/J, where\nφ and θ are, respectively, toroidal and poloidal angular\ncoordinates that increase by 2π in going around the long\nor short toroidal circular paths. We also considered the\nrotational transform ι = n/m, where n,m ∈Z and are\ncoprime.\nFIG. 1. Ribbon surface. Ribbon defined by two closely\nlying rational magnetic field lines labelled by C1 and C2.\nNow, consider Newcomb’s condition on the last term\non the right-hand-side of Eq. (4), and define the closed\nline integral Iso that,\n∮ 1\nJ′\ndl\nB = I\nBα(ψ). (7)\nFor Newcomb’s condition on Eq. (4) to hold, the following\nmust be true:\nBα(ψ) = I\n2πn. (8)\nWith this choice\n∮\nB ·∇ωdl\nB = 0. (9)\nThen there must exist a single-valued solution ωthat enacts the coordinate transformation into the set associated\nwith the Jacobian (5).\nFor Eq. (8) to hold, it is necessary for Ito be a flux\nfunction. This condition may be written in the form,\nI(ψ) =\n∮\nB ·dr, (10)\nalong a magnetic field line. Focus on the case of rational\nfield lines, for which the condition is most stringent. To\nproceed further, consider a non-self-intersecting ribbon\nover a flux surface bounded by two adjacent magnetic\nfield lines (see Fig. 1). Denote the line integrals along\neach of these lines by C1 and C2. We may then write,\nusing Stoke’s theorem,\n∮\nC1\nB ·dr −\n∮\nC2\nB ·dr =\n∫\nrib\nj ·dS, (11)\nwhere the surface element is taken to be perpendicular\nto the flux surface. The integral over the surface may be\nwritten as,18\n∫\nrib\nj ·dS =\n∫ α0+δα\nα0\ndα\n∮ dl\nBj ·∇ψ.\nHere α labels the field lines on the surface. Now, if Iis\nto be truly a flux function, then following (11) the last\nRodr´ ıguez et al. 4\nsurface integral must vanish. And it must do so for all\nfield line labels α0 and adjacent field lines δα. This gives\nthe necessary and sufficient condition,\n∮ dl\nBj ·∇ψ= 0. (12)\nThis subclass of magnetic fields grant the required form\nof I, and thus the single-valued solution to Eq. (4). This\nsubclass includes the stronger constraint j ·∇ψ= 0 (for\nwhich the coordinates reduce to Boozer coordinates), or\nmore to our concern, the QS scenario (see Appendix A).\nNote that magnetic fields that possess nested surfaces\nhave the property ⟨j ·∇ψ⟩= 0, following an application\nof ∇·(B ×∇ψ). However, the Newcomb condition (12)\nis more stringent.20\nRestricting ourselves to this subclass of fields, we can\nalways choose Bα as in (8). This choice might appear\nartificial, especially given the presence of the discrete n.\nHowever, we may relate this to the surface average over\nirrational flux surfaces. To see this, we write,\nBα = 1\n2πn\n∮\nB2 dl\nB = 1\n4π2n\n∫ 2π\n\ndα\n∮\nB2 dl\nB \nn turns\n=\n= 1\n4π2\n∫ 2π\n\ndα\n∮\nB2 dl\nB \n1 turn\n= V′\n4π2 ⟨B2⟩, (13)\nwhere V′ =\n∫2π\n0 dα\n∫\ndl/B = V′⟨1⟩is the usual volume\nψ derivative. Bα is, therefore, a well behaved quantity\nfor both rational and irrational surfaces.\nAs it stands, with an appropriate choice of Bα, and\nrestricting ourselves to the subclass of fields satisfying\nEq. (12), one may perform a coordinate transformation\nthat provides a Jacobian of the desired form (5). It remains to be shown that this new coordinate system is\nwell-behaved. By this we mean that the Jacobian does\nnot vanish nor diverge. To see this, it is most useful to\nrewrite J′in terms of (13),\nJ′= Bα\nB2 = V′\n4π2\n⟨B2⟩\nB2 . (14)\nIt is clear that J′has a definite sign, as given by the sign\nof V′. Thus, the Jacobian will never vanish nor diverge,\ngiven that B2 >0.\nC. Magnetic field in generalized Boozer coordinates\nWe have shown that under the assumption that the\nmagnetic field is quasisymmetric, we have a straightfield-line coordinate system, GBC, with a Jacobian J =\nBα(ψ)/B2. In this coordinate system and following\nSec. II A, a quasisymmetric field is one whose magnitude\ncan be expressed in GBC as B = B(ψ,θ −˜αφ). This\nform is analogous to the Boozer formulation of QS but\nis only subordinated to the configuration being QS and\nnot j ·∇ψ= 0.\nBefore proceeding to analyze some of the properties of\nQS and other governing equations in GBC, we explicitly\nwrite the covariant and contravariant forms for B. The\ncovariant form is\nB = Bθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ. (15)\nThe usual covariant function Bφ has been deprecated for\nBα. This is the flux function that appears in the Jacobian\n(5). The simplicity of (15) shows why it was convenient\nto choose the Jacobian of GBC to have the particular\nform in (5). To obtain (15), it is sufficient to take its\nscalar product with the contravariant representation,\nB = ∇ψ×∇θ+ ι∇φ×∇ψ, (16)\nand capitalize on the definition of J. Compared to\nBoozer coordinates, the covariant function Bθ in GBC is\nnot necessarily a flux function. Thus, GBC is an extension of Boozer coordinates. When j ·∇ψ = 0, however,\nand as previously pointed, GBCs reduce to Boozer coordinates. So far, the forms in (15) and (16) have only\nrequired of the existence of GBC, and not of QS per se\n–other than to guarantee (12). To enforce the latter, one\nneeds to specify |B|as a symmetric function. 14,16,21\nIII. DESCRIBING WEAK QUASISYMMETRY IN GBC\nHaving developed GBC, let us see what this coordinate\nsystem can teach us about weak QS. We first write down\nthe complete set of equations that describe a weakly quasisymmetric magnetic field. The first relevant equation\nequates the covariant (15) and contravariant (16) forms\nof the magnetic field. The equation reads,\nBθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ=\n= ∇ψ×∇θ+ ι∇φ×∇ψ. (17)\nTo specify that we are using GBC, and to introduce the\nquasisymmetric condition, we require\n∇ψ×∇θ·∇φ= Bα(ψ)\nB2(ψ,θ −˜αφ). (18)\nThis set of equations (17) and (18) is entirely selfcontained. It describes a general magnetic field (a vector field satisfying ∇· B = 0 = B ·∇ψ, without a particular form of equilibrium) that is quasisymmetric —\nno more, no less. Such equations have been recently\nused in near-axis expansions with anisotropic pressure\nequilibria.14,16 Equations (17) and (18), referred to as\nthe co(ntra)variant and Jacobian equations respectively,\nwere there expanded systematically. A more thorough\nand complete exploration of the expansion of the magnetic equations and its implications will be presented in\na separate publication.\nRodr´ ıguez et al. 5\nEquations (17) and (18) apply beyond near-axis expansions. Alternatively, one could study the behavior\nof this system in a perturbative sense but around a surface rather than the axis. 21 This could shed some light\nto standard optimization approaches to QS 9,10.\nBeyond the set of equations (17) and (18), the formulation of weak QS in terms of GBC can be used to compare\nweak QS to other (quasi)symmetric forms. This comparison to strong and axisymmetry will help frame the notion\nof weak QS in the larger space of configurations.\nA. Comparison to strong quasisymmetry\nStrong QS is a more resrictive form of QS compared to\nits weak form3,12. Weak QS is the necessary and sufficient\ncondition for the leading guiding centre dynamics to have\nan approximately conserved momentum to leading gyrocentre order.3 This condition can be written as in (1),\nbut is most naturally given as,\nu ·∇B = 0, (19)\nB ×u = ∇Φ(ψ), (20)\n∇·u = 0. (21)\nThe vector fieldu is defined by these equations and points\nin the direction of symmetry. In strong QS the conserved momentum for the particle dynamics is exact for\nthe first-order guiding centre Lagrangian.3,5,12 In the notation that explicitly introduces the symmetry vector u,\nstrong QS is equivalent to weak QS – i.e., Eqs. (19)-(21)–\nplus the constraint\nj ×u + ∇C = 0, (22)\nwhere C = B·u. Note that only the b component of this\nadditional constraint is contained in the weak formulation of the problem. We now explore the significance of\n(22) in the context of GBC.\nTo begin, we need to construct u. From Eqs. (19)(21),3\nu = ¯ι∇ψ×∇χ\nB ·∇χ , (23)\nwhere χ= θ−˜αφ, and it is convenient to use χas part of\nthe coordinate triplet {ψ,χ,φ }. We have made the choice\nΦ′= ¯ι so that u ·∇φ = 1, for simplicity. Scaling of the\nflux label for u leaves the weak QS conditions unchanged\n(see Appendix B for further discussion on the gauge and\nthe particular choice). This form of u is equivalent to\nEqs. (19)-(21) only if we enforce B = B(ψ,χ) and the\ncoorrdinate system B ·∇χ= ¯ιJ−1 with the Jacobian in\n(5). The parameter ¯ι= ι−˜α has been defined.\nFrom (15) and (23), we find,\nC = Bα −¯ιBθ. (24)\nThis relation (24), together with (23) and the definition\nof C can be used to write the gauge-independent form,\nB ·∇ψ×∇B = Bα −¯ιBθ\n¯ι B ·∇B. (25)\nThe magnetohydrostatic form of this equation has been\nused previously18,22.\nThe contravariant form of u together with (16) give\nsimple expressions for directional derivatives in GBC.\nThe differential operators can be simply written as partial derivatives,\nB ·∇ = J−1 (¯ι∂χ + ∂φ) , u ·∇ = ∂φ. (26)\nSince the Jacobian is quasisymmetric from (5), the two\noperators B·∇and u·∇commute with each other11. This\ncommutation property is made manifest in the GBC.\nThe symmetry field u can also be written in the following covariant form:\nu = uψ∇ψ+ uχ∇χ+ uφ∇φ. (27)\nTaking the scalar product with u and B we obtain\nuφ = u2, ¯ιuχ + uφ = C\nJ−1 . (28)\nTo complete the family of vectors required for the\nstrong quasisymmetric condition (22), we need a closed\nform for j in GBC. From the curl of the covariant form\nof B in Eq. (15), we obtain\nj = ∇Bψ×∇ψ+ ∇Bθ×∇χ+ ∇(Bα−¯ιBθ) ×∇φ. (29)\nUsing (29) and (24), we can show that\nj×u+∇C = (u·∇Bψ)∇ψ+ (u·∇Bθ) (∇χ−¯ι∇φ) .\n(30)\nThe simplicity of (30) is due to the choice of GBC.\nRecall that strong QS requires the expression j ×u +\n∇C to be identically zero. This means that all the terms\non the right side of (30) need to vanish. That is to say, the\ncovariant functions (Bψ,Bθ) are required to be quasisymmetric. If Bθ is quasisymmetric, then C is automatically\nso from (24). In an explicit coordinate representation,\nusing (26), we may write Bθ(ψ,χ) and Bψ(ψ,χ).\nThus, the GBC representation provides an elegant way\nto formulate strong QS, which can now be understood as\nweak quasisymmetry plus the conditions that Bψ and Bθ\nare QS. In other words, not only is B QS but so are Bθ\nand Bψ.\na. Implications for near-axis expansion . We refer\nto [14] and [16] for a detailed treatment using near-axis\nexpansions of the weakly quasisymmetric problem. The\nprocedure is based on expanding all governing equations\ndescribing the weak quasisymmetric field in some form of\nequilibrium in powers of the distance from the magnetic\naxis. To do so efficiently, both the field and equations\n–including Eqs. (17) and (18)– are expressed in GBC. It\nwas shown there that when equilibria with anisotropic\npressure are considered, the common overdetermination\nproblem7,8 that limits the expansion is overcome. The\nnumber of governing equations becomes the same as that\nof functions to be solved.\nRodr´ ıguez et al. 6\nExtending the expansion to strong QS is straightforward. From the discussion above, the only difference\nis that the covariant functions Bθ and Bψ are both\nquasisymmetric rather than general functions of space.\nIn practical terms, this simply leads to more restricted\nTaylor-Fourier expansions of those functions; the coefficients that were functions of φ become constants. This\nrestriction in freedom once again leads us back to the\nGarren-Boozer overdetermination problem. In fact, it\ndoes so the same way as it did in the case of MHS equilibrium. The restriction on the covariant functions imposes\nvery severe constraints on the allowed geometry. The\nonly way to escape this impasse is to assume axisymmetry (φ independent). Once again, consistent with what\nwe observed in [14], the asymmetry of the covariant representation of B appears to be vital to the construction\nof QS solutions.\nB. Comparison to axisymmetry\nWe have seen that the strong formulation of QS is more\nconstraining than its weak form. We would also like to\ncompare QS to the limiting case of axisymmetry. We\nshall think of this case as a symmetry generated by rotation in space: the system is invariant under rotations\nabout an axis. In Euclidean space, a rotation is an isometry, and it is generated by a vector field known as a\nKilling vector. Using the notion of a Killing vector, we\nwant to explore how ‘far’ the weak concept of QS is from\nthis ‘true’ symmetry.\nA measure of the departure of a symmetry generator from a Killing vector is the so-called deformation\nmetric.12 Taking u to represent the symmetry vector for\nQS, the idea is to see how far it is from being a Killing\nvector. A vector field v is Killing if and only if the deformation tensor Lvg= 0. Here Lu denotes the Lie derivative along u and g is the Euclidean metric. In 3D, this\nmay be written as,\nLug= ∇u + (∇u)T. (31)\nEvaluating this tensor for a quasisymmetric configuration\nshould then provide information regarding the closeness\nto an isometry. It is convenient12 to evaluate this rank-2\ntensor in a basis defined by {B,u,∇ψ}, a triplet that we\nshall take to be independent. Then,\n[\n∇u + (∇u)T]\n·B =j ×u + ∇C, (32)\n[\n∇u + (∇u)T]\n·u =∇u2 −u ×∇× u, (33)\n[\n∇u + (∇u)T]\n·∇ψ=∇ψ×∇× u + 2∇ψ·∇u, (34)\nwhere the weak quasisymmetric properties have been\nused where necessary. Equation (33) is what Burby et al.\ncall w.11 We shall explore this vector w in more detail\nafter obtaining an explicit form for Lug. Using (32)-(34),\nand projecting once again onto the non-orthogonal basis\ntriplet, we obtain\nLug=\n\n\n0 u ·∇C ∇ψ·(j ×u + ∇C)\n... u ·w w ·∇ψ\n... ... −u ·∇|∇ψ|2\n\n. (35)\nThe matrix is symmetric by construction. The content\nof its elements can be made clearer using GBC explicitly.\nThe top row, corresponding to (32), has already been\ndealt with, as it is precisely the piece corresponding to\nstrong QS. We made the observation that for this condition to be satisfied, (30) required u·∇Bθ = 0 = u·∇Bψ.\nFor the other components, the vector field w = ωu ×\nu + ∇u2, ω u = ∇×u is key. Using the covariant form\nof u we obtain the curl of the vector u in the form\nωu = ∇uψ ×∇ψ+ ∇uχ ×∇χ+ ∇uφ ×∇φ. (36)\nTaking the cross product with u, using the orthogonality\nof u with ∇ψ and ∇χ, and (24) and (28) we get\nw = (u ·∇uψ)∇ψ+ (u ·∇u2)\n(\n∇φ−1\n¯ι∇χ\n)\n+\n−J(u ·∇Bθ)∇χ, (37)\nwhich implies that\nB ·w = −¯ιu ·∇Bθ, (38)\nu ·w =u ·∇u2. (39)\nMost importantly, a vanishing w implies that the covariant components of the symmetry vector as well as Bθ are\nquasisymmetric.\nTo complete the simplification of the metric tensor, we\ninvoke B2 = ( C2 + |∇ψ|2)/u2, which follows from the\ndefinition of u. This means,\nu ·∇|∇ψ|2 = B2u ·∇u2 −u ·∇C2. (40)\nWith this coordinate representation, the dependence of\nthe various metric pieces is made explicit. We may then\nschematically present the dependence of Lug as follows,\nLug∼\n\n\n0 ∂φBθ ∂φBθ, ∂φBψ\n... ∂ φu2 ∂φu2, ∂φuψ\n∂φBθ\n... ... ∂φBθ, ∂φu2\n\n. (41)\nThe boxed expressions are meant to indicate that the\ncorresponding tensor component vanishes if the expressions there do. If the tensor (41) is to vanish, then the\nsymmetry vector would correspond to a rotation. This is\nnot surprising if one looks at what it means for the components in (41) to vanish. Axisymmetry is reached when\nthe covariant components of the magnetic field and the\nsymmetry vector are themselves symmetric. The latter is\nintimately related to the geometry, as we may see when\nwriting u ∝∂φx|χ,ψ.\nRodr´ ıguez et al. 7\nFrom (41), it follows that in some sense, weak QS is\nfar from being an isometry. This is so because only one\nof the components of the tensor exactly vanishes. The\nφ dependence of the functions Bψ, Bθ, u2, and uψ takes\nthe configuration away from axisymmetry. These apparent four degrees of freedom (especially those involving\nu) may not be independent and involve highly non-linear\ncombinations—they should ultimately be related through\nthe quasisymmetric magnetic equations. Anyhow, the\nfield-line dependence is key in distinguishing the weakly\nquasisymmetric form from, say, an axisymmetric tokamak.\nTo make a comparative measurement of the departure\nfrom axisymmetry, consider now the case of strong QS.\nIn this case, following (22), the first whole row (and thus\nalso column) of (41) drop. The remaining dependence\nalso simplifies, and the system is precluded from being\nan isometry, a priori, through the φ dependence of u2\nand uψ only, which is consistent with the work by Burby\net al.12\nImposing additional properties to the field may also\naffect the form of the deformation tensor. An example\nwould be a particular form of force balance. We now\nexplore how the magnetics and equilibria are linked.\nIV. QUASISYMMETRY AND EQUILIBRIA\nLet us consider the force balance part of the problem.\nGenerally, a magnetic equilibrium with some arbitrary\nforce F reads,\nj ×B = F. (42)\nAs we argued in Sec. II, we are concerned in this work\nwith a general fluid force F. Its connection to the microphysics is not considered. Let us express the left-hand\nside of (42) in GBC. Using the contravariant form for\nthe current (29) together with (24), we obtain\nj ×B =\n[\nB ·∇Bψ −J−1 (B′\nα −¯ι′Bθ)\n]\n∇ψ+\n+ (B ·∇Bθ) (∇χ−¯ι∇φ) , (43)\nwhich is an explicit coordinate representation of j ×B.\nThe form of (43) mirrors the form of (30). In this case,\nthe magnetic differential operators substitute the directional derivatives along u. We note that (43) does not\nhave any component along B, as can be checked by taking the dot product with B.\nThe form of (43) puts constraints on the allowable\nforms for F. As already noted, F ·B = 0 must hold\ntrue. Otherwise the system would be imbalanced, as the\nmagnetic field is unable to exerts forces alongB. Because\nof this reduction in the dimensionality of F and in view\nof (43), it is convenient to write\nF = Fψ∇ψ+ Fα(∇χ−¯ι∇φ) . (44)\nAn alternative form would be to use the contravariant\nform of B (16) in (42) to get\nF = (J ·∇χ−¯ιJ ·∇φ) ∇ψ−(J ·∇ψ) (∇χ−¯ι∇φ) .\n(45)\nSubstituting (43) and (44) into (42) we get two magnetic\ndifferential equations\nB ·∇Bψ = Fψ + J−1 (B′\nα −¯ι′Bθ) , (46a)\nB ·∇Bθ = Fα = −J ·∇ψ. (46b)\nTherefore, the generalized force-balance condition is\nequivalent to two magnetic differential equations (MDEs)\nand B ·F = 0. If solutions to these equations can be\nfound together with the magnetic equations (17) and\n(18), we will have obtained a quasisymmetric configuration in equilibrium.\nLet us describe in more detail the implications of these\nequations. First look at the simpler (46b). This equation\nhas two pieces to it. First, and regardless of the assumed\nform for Fα, it follows from weak QS that B ·∇Bθ =\n−j ·∇ψ (see Appendix A as well). This imposes the\ncondition (II B) on the field. Secondly, the component\nFα of the forcing F directly sets the off-surface current.\nThis means that,20\n∮ dℓ\nBFα = 0. (47)\nWe may not choose Fα arbitrarily. It must satisfy (47) if\nthe force is to be consistent with QS –condition like (12).\nThen the magnetic differential equation can be satisfied,\nand one can directly relate Bθ and Fα up to a flux function. In Fourier representation, it is clear that the φcontent of Bθ will be non-zero only if that of Fα is (and vice\nversa). In the light of (41), choosing ∂φ(j·∇ψ) = 0 brings\nthe quasisymmetric configuration closer to an isometry.\nThis freedom in the form of Bθ does not exist in the\nstrong formulation of QS, for which j·∇ψis independent\nof the field line label.\nA similar analysis is suitable for (46a). The appropriate Newcomb condition in this case is,\n∮ dℓ\nB\n[\nFψ + J−1 (B′\nα −¯ι′Bθ)\n]\n= 0. (48)\nThis condition may be understood as an averaged radial\nequilibrium equation. A similar solvability condition can\nbe found, for the special case of MHS, in [4], where the\nnotion of a QS field is presented as one that satisfies the\nNewcomb conditions. Given Eq. (48), Eq. (46a) relates\nBψ, Bθ (or Fα) and Fψ. Once again, we see the close\nrelationship between the forcing, the magnetic covariant\nrepresentation, and the deviation from axisymmetry. Aφ\ndependence on Bψ will bring a finite deviation of u from\nbeing a Killing vector. However, it will also force Fα\nand Fψ to have a φ dependence which may require very\nparticular shaping of the forces. This observation is consistent with the Constantin-Drivas-Ginsberg theorem 23,\nRodr´ ıguez et al. 8\nin which the forcing is seen to be intimately related to the\ndeviation from an isometry. Here the asymmetric geometry, quasisymmetry, and the forcing are all intimately\nconnected.\nWhen the magnetic differential equations imposing\nforce balance are brought together with the magnetic\nequations from the previous section, it is not obvious\nhow the system of equations is to be interpreted: what\nis to be taken as an input and what should be solved for.\nJust as an analogy, in the Grad-Shafranov equation, it is\nclear that p and F are inputs, and ψ is the output. In\nthe present problem, we have the construction of GBC\nin addition to the various magnetic covariant forms and\nthe components of F. Motivated by the treatment in [3]\n(which deals with a special case of the above), we propose as a possibility for a well-posed problem (46b) to be\nsolved for Bθ given Fα, while Fψ is the output of (46a),\nwith the function Bψ specified from the magnetic equations. It is not a trivial matter to determine a convenient\nway in which to formulate the problem. A more elaborate discussion on this procedure and its implications on\nconstructing solutions is left to future work.\nA. Ideal MHD: j ×B = ∇p(ψ)\nLet us now revisit the limit of ideal MHD without\nflows, j ×B = ∇p. More general forms will be discussed\nelsewhere, together with a more systematic treatment of\nthe quasisymmetric system of equations.\nIn ideal MHD, from j ·∇p(ψ) = 0 it follows that\nB ·∇Bθ = 0. (49)\nThus taking Fα = 0 forces Bθ into a flux function. Of\ncourse, this also means that u·∇Bθ = 0. Furthermore, as\nFψ = p′, in static ideal MHD, (43) leads to the magnetic\ndifferential equation for Bψ,\nB ·∇Bψ = p′(ψ) + J−1 (B′\nα −¯ι′Bθ) (50)\nSince Bψ must be a single-valued function the fluxsurface average of (50) gives\np′(ψ) + ⟨B2⟩\nBα\n(B′\nα −¯ι′Bθ) = 0. (51)\nIf we choose the forms ofB, p, and ι, this pins the form of\nBα down. Now, looking back to (50), every term on the\nright-hand side is quasisymmetric. Therefore, Bψ must\nalso be quasisymmetric if it is to satisfy the force-balance\nequation. Note that we have already recognized this constraining requirement on the form of Bψ as the origin\nof the Garren-Boozer overdetermination problem14. The\nNewcomb condition on this equation can be recognized\nas the condition to avoid Pfirsch-Schl¨ uter current singularities.\nThe simplifications due to ideal static MHD leads to\nthe vanishing of (30). Therefore, in this limit weak QS\nis identical to strong QS.\nOne can further show using (29), (15), (23) and (50),\nj = −1\n¯ι∂ψ(Bα −¯ιBθ)B −1\n¯ιp′(ψ)u, (52)\nwhere the gauge choice for u has been made Φ′= ¯ι. The\nexpression for j ought to be u-gauge independent, as it is\na physical quantity. The B piece as written is gauge independent, but the u term is not. The ¯ιfactor in the latter\nis to be interpreted as Φ ′. This equation had been obtained previously 11,12 using coordinate free, differential\nforms (see Appendix C). Two special cases of ideal MHD,\na) vacuum (j = 0) and b) force-free ( j = λ(ψ,α)B), are\nworth pointing to for their importance in plasma physics.\nFor both these cases p′(ψ) = 0. From (52) we see that for\nthe magnetic field to be curl-free (vacuum) and quasisymmetric C′(ψ) = 0; i.e., C(ψ) must be a constant. For quasisymmetric force-free fields, we must have λ= −C′(ψ).\nNote that in strong QS these conclusions follow directly\nfrom the equation j ×u+∇C = 0 with j = 0 and j = λB.\nV. CONCLUSIONS\nIn this paper, we have presented, defined, and discussed a straight field line coordinate system which is\nnatural for the representation of general-equilibria quasisymmetric magnetic fields: generalized Boozer coordinates. We proved the existence of the said coordinate system for the subset of fields for which\n∮\nj·∇ψdl/B = 0, to\nwhich quasisymmetric fields belong. These coordinates\nreduce to Boozer coordinates when j ·∇ψ= 0.\nThe explicit form of the symmetry in this coordinate\nrepresentation enables a simple formulation of the quasisymmetric problem. We explicitly construct the governing equations setting clearly the foundation for future investigations, including expansion 14,16 and global\napproaches. Exploiting GBC, we explicitly show the essential differences between the weak and strong formulations of QS and between quasisymmetry and axisymmetry. Weak QS generally lies far from axisymmetry, for\nwhich many functions describing the field and symmetry\nneed to be symmetric.\nWe also included a set of simple magnetic differential\nequations that fully describe equilibrium with an arbitrary macroscopic force to complete the treatment. The\nproperty of QS, together with the force-balance structure, imposes requirements on the forcing terms in the\nform of Newcomb conditions. In addition, the equations\nestablish clear connections between QS, forcing, and departures from axisymmetry.\nACKNOWLEDGEMENTS\nWe are grateful to J. Burby, N. Duigan, J. Meiss,\nand D. Ginsburg for stimulating discussions. This research is supported by grants from the Simons FoundaRodr´ ıguez et al. 9\ntion/SFARI (560651, AB) and DoE Contract No DEAC02-09CH11466.\nDATA A VAILABILITY\nData sharing is not applicable to this article as no new\ndata were created or analyzed in this study.\nAppendix A: Off-surface current and QS\nIn this appendix we directly show how the triple vector\nformulation of QS in Eq. (1) imposes a constraint on the\noff-normal component of the current at every magnetic\nsurface.\nLet us start by noting that B ·∇B = f(ψ,B) is a\nconsequence of (1). In that sense, we can reshape the\ntriple vector condition in the convenient form,\n∇ψ×∇B·∇\n( B2\nB ·∇B\n)\n= 0. (A1)\n(For now we shall not worry about B·∇B = 0.) We may\nexpress this as a divergence as ∇·(∇ψ×∇B) = 0. Separating the argument of the divergence into a component\nparallel to the magnetic field and perpendicular to it, we\nmay rewrite (A1),\n∇·\n[∇ψ×∇B·B\nB ·∇B B + ∇ψ×B\n]\n= 0. (A2)\nIt is customary to define C = ∇ψ×∇B·B/(B ·∇B).\nThis can therefore be rewritten, using ∇·B = 0,\nB ·∇C = j ·∇ψ. (A3)\nIt is then clear that the Newcomb condition 20,\n∮\nj ·∇ψdl\nB = 0, (A4)\nwith the line integral taken along closed magnetic field\nlines. For irrational field lines, this amounts to ⟨j·∇ψ⟩=\n0, which is true of all magnetic fields with flux surfaces\nB ·∇ψ = 0. The condition (A4) is a more constraining\ncondition than the flux surface average one 20.\nThe division byB·∇Bseems to be ill-defined at the extrema of the magnetic field along the field lines. However,\nthe above holds, first, arbitrarily close to these extrema,\nand thus would expect to hold by continuity. The fact\nthat the condition holds can be seen by looking at the\nfundamental formulation of QS. 3 As for the behavior of\nC, this is a physical quantity related to the width of banana orbits of bouncing particles. If deeply trapped and\nbarely trapped particles are to be kept, then physically\nC must be finite at the extrema. Once we express everything in terms of GBC, the apparent divergence of C\ndisappears as both the numerator and denominator are\nproportional to ∂χB = 0 when B ·∇B = 0.\nAppendix B: Gauge freedom in weak u\nAs noted, the definition of u by the weak quasisymmetry equations (19)-(21) is invariant to the relabelling of\nΦ. A rescaling of the flux surface label together with u\nleaves the equations invariant, and should therefore have\nno physical implication in the description of quasisymmetry. Keeping the general gauge label Φ( ψ) of (20),\nEq. (23) reads,\nu = ∇Φ ×∇χ\nB ·∇χ . (B1)\nThen it folllows,\nC = Φ′\n¯ι (Bα −¯ιBθ). (B2)\nand\nu ·∇ = Φ′\n¯ι ∂φ. (B3)\nIn Sec. III A we then went on to evaluate how the symmetry vector u as defined by weak QS compares to strong\nQS. To do so, we needed to evaluate (30), the third of\nthe strong QS conditions. The first thing to note is\nthat this equation is not gauge-invariant. This is, of\ncourse, an added complication to the problem. In a sense,\nthis gauge-symmetry breaking leaves no unique form for\nj ×u + ∇C, but rather a whole family.\nThis can be seen explicitly if instead of restricting ourselves to the special case Φ ′ = ¯ι, we keep Φ general.\nThen, (30) can be written in the form,\nj ×u + ∇C =\n[\nu ·∇Bψ −CΦ′\n¯ι ∂ψ\n( ¯ι\nΦ′\n)]\n∇ψ+\n+ (u ·∇Bθ) (∇χ−¯ι∇φ) . (B4)\nSo the question may be reformulated: what are the conditions for this expression to vanish? As before one piece\nis u ·∇Bθ = 0. Now, the other component has the form\nof a differential equation along the direction of symmetry.\nWe may write it explicitly as,\n∂φBψ −(Bα −¯ιBθ)∂ψln(¯ι/Φ′) = 0. (B5)\nFor the equation to be consistent it must then hold that\neither ¯ι= aΦ′where ais a constant, or\n∫\n(Bα−¯ιBθ)dφ=\n0. The latter is a condition on the physically relevant\npart of the problem. Thus, it is minimal to choose Φ′= ¯ι\nto satisfy this condition. Then we are simply left with\nu ·∇Bψ = 0, which is the choice made in the main text.\nAppendix C: Coordinate independent form of\nquasisymmetric current\nIn this appendix we present a brief derivation of (52)\nwithout using coordinates. Start from,\nB ×u = ∇Φ (C1)\nRodr´ ıguez et al. 10\nwith Φ some single-valued function. With magnetic flux\nsurfaces labelled by Φ, to which B as well as the vector\nfield u are tangent. From this condition, one may also\nwrite the symmetry vector field in the following form,\nu = 1\n|B|2 (CB −B ×∇Φ). (C2)\nHere C = B ·u is some arbitrary function of space, as\nthe constraint equation only describes the perpendicular\npart of u.\nBecause of magnetostatic equilibrium, we hav\nj ×B = ∇p (C3)\nwhere j = ∇×B is the plasma current density. Because\nB ·∇p = 0, it follows that p = p(ψ), again assuming\nthat B is irrational. By continuity, we argue that this\nmust be true for the rational cases as well (provided the\nrotational transform is not constant ι′̸= 0).\nBecause the plasma pressure is also a flux function, the\ncurrent densityj also lives on magnetic flux surfaces. Furthermore, requiring ∇p to be nowhere vanishing (except\nthe axis), the magnetic field and the currents are guaranteed to be non-collinear. Therefore, at every point on\na flux surface, we may define a tangent plane spanned by\nj and B.\nBecause u also exists in this subspace, we can express\nit in the chosen basis as,\nu = aj + ˜βB\nTaking the cross product of this form of u with B, and\nmaking use of (C1) and (C3), it is clear that a= a(ψ) =\n−Φ′/p′, where the prime denotes a derivative with respect to the flux ψ.\nLet us now apply (21), namely ∇·u = 0, on our form\nof u,\n∇·u = B ·∇˜β = 0 ∴ ˜β = ˜β(ψ)\nso that,\nj = −p′(ψ)\nΦ′ u + β(ψ)B. (C4)\nUsing the form (C2), we can then rewrite the expression\nfor j in the form,\nj =\n(\nβ(ψ) −p′(ψ)C\nB2Φ′\n)\nB + p′(ψ)\nB2 B ×∇ψ (C5)\nThe gauge-independent combination C/Φ′= (B ·∇ψ×\n∇B)/(B ·∇B) here is the same as appears in (25), and\nit is a flux function. There we found a closed form for\nthis combination in terms of the covariant representation\nof B in GBC. Equation (C5), together with ∇· B = 0\nand B ·∇ψ = 0 constitute the whole set of governing\nequations for the QS, MHS field. The problem in this\ncase can be formulated by providing β, C/Φ′ and p as\ninput flux functions and solving for B.\nAssuming that we have strong QS, β = −C′/Φ′, obtaining an expression equivalent to that obtained in (52).\nThe assumption of strong QS was, however, not necessary when working with the coordinate representation.\nThe adoption of some form of covariant representation\nof B seems to be necessary to obtain such a relation.\nTo that end, GBC is convenient, though not necessary.\nIn fact, [11] employs the B♭ form (the generalisation of\nthe covariant representation of the magnetic field), without specifying coordinates, to prove (52). Note, however,\nthat the difference in the form of β is merely a matter of\nhow the flux function is defined. The problem has β or\nΦ′as free flux function inputs.\n1J. N¨ uhrenberg and R. Zille, Physics Letters A129, 113 (1988).\n2A. H. Boozer, The Physics of Fluids 24, 1999 (1981).\n3E. Rodr´ ıguez, P. Helander, and A. Bhattacharjee, Physics of\nPlasmas 27, 062501 (2020).\n4M. Tessarotto, J. L. Johnson, and L. J. Zheng, Physics of Plasmas 2, 4499 (1995).\n5M. Tessarotto, J. L. Johnson, R. B. White, and L. Zheng, Physics\nof Plasmas 3, 2653 (1996).\n6A. H. Boozer, Physics of Fluids 26, 496 (1983).\n7D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma\nPhysics 3, 2822 (1991).\n8M. Landreman and W. Sengupta, Journal of Plasma Physics 84,\n905840616 (2018).\n9A. Bader, M. Drevlak, D. T. Anderson, B. J. Faber, C. C. Hegna,\nK. M. Likin, J. C. Schmitt, and J. N. Talmadge, Journal of\nPlasma Physics 85, 905850508 (2019).\n10S. Henneberg, M. Drevlak, C. N ˜A¼hrenberg, C. Beidler,\nY. Turkin, J. Loizu, and P. Helander, Nuclear Fusion 59, 026014\n(2019).\n11J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Physics\nA: Mathematical and Theoretical 54, 125202 (2021).\n12J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Mathematical Physics 61, 093503 (2020).\n13J. W. Burby and H. Qin, Physics of Plasmas 20, 012511 (2013).\n14E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,\n012508 (2021).\n15D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma\nPhysics 3, 2805 (1991).\n16E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,\n012509 (2021).\n17M. D. Kruskal and R. M. Kulsrud, The Physics of Fluids 1, 265\n(1958).\n18P. Helander, Reports on Progress in Physics 77, 087001 (2014).\n19A more general form J ′ = J ′(ψ,B) could have been demanded.\nHowever, one may show in that case that the system may always\nbe cast into the form here employed.\n20W. A. Newcomb, The Physics of Fluids 2, 362 (1959).\n21W. Sengupta, E. J. Paul, H. Weitzner, and A. Bhattacharjee,\nJournal of Plasma Physics 87, 905870205 (2021).\n22E. J. Paul, T. Antonsen, M. Landreman, and W. A. Cooper,\nJournal of Plasma Physics 86, 905860103 (2020).\n23P. Constantin, T. D. Drivas, and D. Ginsberg, Journal of Plasma\nPhysics 87, 905870111 (2021).",
"[ABSTRACT]\nReal-world time-series datasets are often multivariate with complex dynamics. To capture\nthis complexity, high capacity architectures like recurrent- or attention-based sequential\ndeep learning models have become popular. However, recent work demonstrates that simple\nunivariate linear models can outperform such deep learning models on several commonly\nused academic benchmarks. Extending them, in this paper, we investigate the capabilities of\nlinear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel\narchitecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on\nmixingoperationsalongboththetimeandfeaturedimensionstoextractinformationefficiently.\nOn popular academic benchmarks, the simple-to-implement TSMixer is comparable to\nspecialized state-of-the-art models that leverage the inductive biases of specific benchmarks.\nOn the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer\ndemonstrates superior performance compared to the state-of-the-art alternatives. Our results\nunderline the importance of efficiently utilizing cross-variate and auxiliary information for\nimproving the performance of time series forecasting. We present various analyses to shed\nlight into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected\nto open new horizons for deep learning-based time series forecasting. The implementation\nis available at:https://github.com/google-research/google-research/tree/master/\ntsmixer.\n\n[1 Introduction]\nTime series forecasting is a prevalent problem in numerous real-world use cases, such as for forecasting of\ndemand of products (Böse et al., 2017; Courty & Li, 1999), pandemic spread (Zhang & Nawata, 2018), and\ninflation rates (Capistrán et al., 2010). The forecastability of time series data often originates from three\nmajor aspects:\n\narXiv:2303.06053v5 [cs.LG] 11 Sep 2023\nPublished in Transactions on Machine Learning Research (09/2023)\nFigure 1: TSMixer for multivariate time series forecasting. The columns of the inputs means different\nfeatures/variates and the rows are time steps. The fully-connected operations are row-wise. TSMixer contains\ninterleaving time-mixing and feature-mixing MLPs to aggregate information. The number of mixer layer is\ndenoted asN. The time-mixing MLPs are shared across all features and the feature-mixing MLPs are shared\nacross all of the time steps. The design allow TSMixer to automatically adapt the use of both temporal and\ncross-variate information with limited number of parameters for superior generalization. The extension with\nauxiliary information is also explored in this paper.\n• Persistent temporal patterns: encompassing trends and seasonal patterns, e.g., long-term inflation,\nday-of-week effects;\n• Cross-variate information: correlations between different variables, e.g., an increase in blood pressure\nassociated with a rise in body weight;\n• Auxiliary features: comprising static features and future information, e.g., product categories and\npromotional events.\n\nPublished in Transactions on Machine Learning Research (09/2023)\nTraditional models, such as ARIMA (Box et al., 1970), are designed for univariate time series, where only\ntemporal information is available. Therefore, they face limitations when dealing with challenging real-world\ndata, which often contains complex cross-variate information and auxiliary features. In contrast, numerous\ndeep learning models, particularly Transformer-based models, have been proposed due to their capacity to\ncapture both complex temporal patterns and cross-variate dependencies (Gamboa, 2017; Li et al., 2019; Wen\net al., 2017; Zhou et al., 2021; Wu et al., 2021; Lim & Zohren, 2021; Liu et al., 2022a; Zhou et al., 2022b; Liu\net al., 2022b; Zhou et al., 2022a) .\nThe natural intuition is that multivariate models, such as those based on Transformer architectures, should be\nmore effective than univariate models due to their ability to leverage cross-variate information. However, Zeng\net al. (2023) revealed that this is not always the case – Transformer-based models can indeed be significantly\nworse than simple univariate temporal linear models on many commonly used forecasting benchmarks. The\nmultivariate models seem to suffer from overfitting especially when the target time series is not correlated\nwith other covariates. This surprising finding has raised two essential questions:\n1. Does cross-variate information truly provide a benefit for time series forecasting?\n2. When cross-variate information is not beneficial, can multivariate models still perform as well as\nunivariate models?\nTo address these questions, we begin by analyzing the effectiveness of temporal linear models. Our findings\nindicate that their time-step-dependent characteristics render temporal linear models great candidates for\nlearning temporal patterns under common assumptions. Consequently, we gradually increase the capacity of\nlinear models by\n1. stacking temporal linear models with non-linearities (TMix-Only),\n2. introducing cross-variate feed-forward layers (TSMixer).\nThe resulting TSMixer alternatively applies MLPs across time and feature dimensions, conceptually corresponding totime-mixing and feature-mixing operations, efficiently capturing both temporal patterns and\ncross-variate information, as illustrated in Fig. 1. The residual designs ensure that TSMixer retains the\ncapacity of temporal linear models while still being able to exploit cross-variate information.\nWe evaluate TSMixer on commonly used long-term forecasting datasets (Wu et al., 2021) where univariate\nmodels have outperformed multivariate models. Our ablation study demonstrates the effectiveness of stacking\ntemporal linear models and validates that cross-variate information is less beneficial on these popular datasets,\nexplaining the superior performance of univariate models. Even so, TSMixer is on par with state-of-the-art\nunivariate models and significantly outperforms other multivariate models.\nTo demonstrate the benefit of multivariate models, we further evaluate TSMixer on the challenging M5\nbenchmark, a large-scale retail dataset used in the M-competition (Makridakis et al., 2022). M5 contains crucial\ncross-variate interactions such as sell prices (Makridakis et al., 2022). The results show that cross-variate\ninformation indeed brings significant improvement, and TSMixer can effectively leverage this information.\nFurthermore, we propose a principle design to extend TSMixer to handle auxiliary information such as static\nfeatures and future time-varying features. It aligns the different types of features into the same shape then\napplied mixer layers on the concatenated features to leverage the interactions between them. In this more\npractical and challenging setting, TSMixer outperforms models that are popular in industrial applications,\nincluding DeepAR (Salinas et al. 2020, Amazon SageMaker) and TFT (Lim et al. 2021, Google Cloud Vertex),\ndemonstrating its strong potential for real world impact.\nWe summarize our contributions as below:\n• We analyze the effectiveness of state-of-the-art linear models and indicate that their time-stepdependent characteristics make them great candidates for learning temporal patterns under common\nassumptions.\n\nPublished in Transactions on Machine Learning Research (09/2023)\nTable 1: Recent works in time series forecasting. Category I is univariate time series forecasting; Category II\nis multivariate time series forecasting, and Category III is time series forecasting with auxiliary information.\nIn this work, we propose TSMixer for Category II. We also extend TSMixer to leverage auxiliary information\nincluding static and future time-varying features for Category III.\nCategory\nExtrapolating Consideration of Consideration of\nModelstemporal patternscross-variate informationauxiliary features\n(i.e. multivariateness)\nI ✔\nARIMA (Box et al., 1970)\nN-BEATS (Oreshkin et al., 2020)\nLTSF-Linear (Zeng et al., 2023)\nPatchTST (Nie et al., 2023)\nII ✔ ✔\nInformer (Zhou et al., 2021)\nAutoformer (Wu et al., 2021)\nPyraformer (Liu et al., 2022a)\nFEDformer (Zhou et al., 2022b)\nNS-Transformer (Liu et al., 2022b)\nFiLM (Zhou et al., 2022a)\nTSMixer(this work)\nIII ✔ ✔ ✔\nMQRNN (Wen et al., 2017)\nDSSM (Rangapuram et al., 2018)\nDeepAR (Salinas et al., 2020)\nTFT (Lim et al., 2021)\nTSMixer-Ext(this work)\n• We propose TSMixer, an innovative architecture which retains the capacity of linear models to\ncapture temporal patterns while still being able to exploit cross-variate information.\n• We point out the potential risk of evaluating multivariate models on common long-term forecasting\nbenchmarks.\n• Our empirical studies demonstrate that TSMixer is the first multivariate model which is on par with\nunivariate models on common benchmarks and achieves state-of-the-art on a large-scale industrial\napplication where cross-variate information is crucial.\n\n[4 Tsmixer Architecture]\nExpanding upon our finding that linear models can serve as strong candidates for capturing time dependencies,\nwe initially propose a natural enhancement by stacking linear models with non-linearities to form multi-layer\nperceptrons (MLPs). Common deep learning techniques, such as normalization and residual connections, are\napplied to facilitate efficient learning. However, this architecture does not take cross-variate information into\naccount.\nTo better leverage cross-variate information, we propose the application of MLPs in the time-domain and\nthe feature-domain in an alternating manner. The time-domain MLPs are shared across all of the features,\nwhile the feature-domain MLPs are shared across all of the time steps. This resulting model is akin to the\nMLP-Mixer architecture from computer vision (Tolstikhin et al., 2021), with time-domain and feature-domain\n\nPublished in Transactions on Machine Learning Research (09/2023)\nwt−2 wt−1 wt ft−2(x)ft−1(x) ft(x)\nxt−2 xt−1 xt\nxt+1\nxt+1 = ∑t\ni=1 wixi\nTime-step-dependent\nxt−2 xt−1 xt\nxt+1\nxt+1 = ∑t\ni=1 fi(x)xi\nData-dependent\nFigure 2: Illustrations of time-step-dependent and data-dependent models within a single forecasting time\nstep.\nFigure 3: The architecture of TMix-Only. It is similar to TSMixer but only applies time-mixing.\noperations representing time-mixing and feature-mixing operations, respectively. Consequently, we name our\nproposed architecture Time-Series Mixer (TSMixer).\nThe interleaving design between these two operations efficiently utilizes both temporal dependencies and\ncross-variate information while limiting computational complexity and model size. It allows TSMixer to use a\nlong lookback window (see Sec. 3), while maintaining the parameter growth in onlyO(L+ C) instead of\nO(LC) if fully-connected MLPs were used. To better understand the utility of cross-variate information and\nfeature-mixing, we also consider a simplified variant of TSMixer that only employs time-mixing, referred to\nas TMix-Only, which consists of a residual MLP shared across each variate, as illustrated in Fig. 3. We also\npresent the extension of TSMixer to scenarios where auxiliary information about the time series is available.\n\n[6 Conclusions]\nWe propose TSMixer, a novel architecture for time series forecasting that is designed using MLPs instead of\ncommonly used RNNs and attention mechanisms to obtain superior generalization with a simple architecture.\nOur results at a wide range of real-world time series forecasting tasks demonstrate that TSMixer is highly\neffective in both long-term forecasting benchmarks for multivariate time-series, and real-world large-scale\nretail demand forecasting tasks. Notably, TSMixer is the only multivariate model that is able to achieve\nsimilar performance to univariate models in long term time series forecasting benchmarks. The TSMixer\narchitecture has significant potential for further improvement and we believe it will be useful in a wide\nrange of time series forecasting tasks. Some of the potential future works include further exploring the\ninterpretability of TSMixer, as well as its scalability to even larger datasets. We hope this work will pave the\nway for more innovative architectures for time series forecasting.\n\nPublished in Transactions on Machine Learning Research (09/2023)",
"[ABSTRACT]\nLow-dimensional embeddings of nodes in large graphs have proved extremely\nuseful in a variety of prediction tasks, from content recommendation to identifying\nprotein functions. However, most existing approaches require that all nodes in the\ngraph are present during training of the embeddings; these previous approaches are\ninherently transductive and do not naturally generalize to unseen nodes. Here we\npresent GraphSAGE, a general inductive framework that leverages node feature\ninformation (e.g., text attributes) to efficiently generate node embeddings for\npreviously unseen data. Instead of training individual embeddings for each node,\nwe learn a function that generates embeddings by sampling and aggregating features\nfrom a node’s local neighborhood. Our algorithm outperforms strong baselines\non three inductive node-classification benchmarks: we classify the category of\nunseen nodes in evolving information graphs based on citation and Reddit post\ndata, and we show that our algorithm generalizes to completely unseen graphs\nusing a multi-graph dataset of protein-protein interactions.\n\n[1 Introduction]\nLow-dimensional vector embeddings of nodes in large graphs 1 have proved extremely useful as\nfeature inputs for a wide variety of prediction and graph analysis tasks [5, 11, 28, 35, 36]. The basic\nidea behind node embedding approaches is to use dimensionality reduction techniques to distill the\nhigh-dimensional information about a node’s graph neighborhood into a dense vector embedding.\nThese node embeddings can then be fed to downstream machine learning systems and aid in tasks\nsuch as node classification, clustering, and link prediction [11, 28, 35].\nHowever, previous works have focused on embedding nodes from a single fixed graph, and many\nreal-world applications require embeddings to be quickly generated for unseen nodes, or entirely new\n(sub)graphs. This inductive capability is essential for high-throughput, production machine learning\nsystems, which operate on evolving graphs and constantly encounter unseen nodes (e.g., posts on\nReddit, users and videos on Youtube). An inductive approach to generating node embeddings also\nfacilitates generalization across graphs with the same form of features: for example, one could train\nan embedding generator on protein-protein interaction graphs derived from a model organism, and\nthen easily produce node embeddings for data collected on new organisms using the trained model.\nThe inductive node embedding problem is especially difficult, compared to the transductive setting,\nbecause generalizing to unseen nodes requires “aligning” newly observed subgraphs to the node\nembeddings that the algorithm has already optimized on. An inductive framework must learn to\n∗The two first authors made equal contributions.\n1While it is common to refer to these data structures as social or biological networks, we use the term graph\nto avoid ambiguity with neural network terminology.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\narXiv:1706.02216v4 [cs.SI] 10 Sep 2018\nFigure 1: Visual illustration of the GraphSAGE sample and aggregate approach.\nrecognize structural properties of a node’s neighborhood that reveal both the node’s local role in the\ngraph, as well as its global position.\nMost existing approaches to generating node embeddings are inherently transductive. The majority\nof these approaches directly optimize the embeddings for each node using matrix-factorization-based\nobjectives, and do not naturally generalize to unseen data, since they make predictions on nodes in a\nsingle, fixed graph [5, 11, 23, 28, 35, 36, 37, 39]. These approaches can be modified to operate in an\ninductive setting (e.g., [28]), but these modifications tend to be computationally expensive, requiring\nadditional rounds of gradient descent before new predictions can be made. There are also recent\napproaches to learning over graph structures using convolution operators that offer promise as an\nembedding methodology [17]. So far, graph convolutional networks (GCNs) have only been applied\nin the transductive setting with fixed graphs [17, 18]. In this work we both extend GCNs to the task\nof inductive unsupervised learning and propose a framework that generalizes the GCN approach to\nuse trainable aggregation functions (beyond simple convolutions).\nPresent work. We propose a general framework, called GraphSAGE (SAmple and aggreGatE), for\ninductive node embedding. Unlike embedding approaches that are based on matrix factorization,\nwe leverage node features (e.g., text attributes, node profile information, node degrees) in order to\nlearn an embedding function that generalizes to unseen nodes. By incorporating node features in the\nlearning algorithm, we simultaneously learn the topological structure of each node’s neighborhood\nas well as the distribution of node features in the neighborhood. While we focus on feature-rich\ngraphs (e.g., citation data with text attributes, biological data with functional/molecular markers), our\napproach can also make use of structural features that are present in all graphs (e.g., node degrees).\nThus, our algorithm can also be applied to graphs without node features.\nInstead of training a distinct embedding vector for each node, we train a set of aggregator functions\nthat learn to aggregate feature information from a node’s local neighborhood (Figure 1). Each\naggregator function aggregates information from a different number of hops, or search depth, away\nfrom a given node. At test, or inference time, we use our trained system to generate embeddings for\nentirely unseen nodes by applying the learned aggregation functions. Following previous work on\ngenerating node embeddings, we design an unsupervised loss function that allows GraphSAGE to be\ntrained without task-specific supervision. We also show that GraphSAGE can be trained in a fully\nsupervised manner.\nWe evaluate our algorithm on three node-classification benchmarks, which test GraphSAGE’s ability\nto generate useful embeddings on unseen data. We use two evolving document graphs based on\ncitation data and Reddit post data (predicting paper and post categories, respectively), and a multigraph generalization experiment based on a dataset of protein-protein interactions (predicting protein\nfunctions). Using these benchmarks, we show that our approach is able to effectively generate\nrepresentations for unseen nodes and outperform relevant baselines by a significant margin: across\ndomains, our supervised approach improves classification F1-scores by an average of 51% compared\nto using node features alone and GraphSAGE consistently outperforms a strong, transductive baseline\n[28], despite this baseline taking ∼100×longer to run on unseen nodes. We also show that the new\naggregator architectures we propose provide significant gains (7.4% on average) compared to an\naggregator inspired by graph convolutional networks [17]. Lastly, we probe the expressive capability\nof our approach and show, through theoretical analysis, that GraphSAGE is capable of learning\nstructural information about a node’s role in a graph, despite the fact that it is inherently based on\nfeatures (Section 5).\n\n[3.3 Aggregator Architectures]\nUnlike machine learning over N-D lattices (e.g., sentences, images, or 3-D volumes), a node’s\nneighbors have no natural ordering; thus, the aggregator functions in Algorithm 1 must operate over\nan unordered set of vectors. Ideally, an aggregator function would be symmetric (i.e., invariant to\npermutations of its inputs) while still being trainable and maintaining high representational capacity.\nThe symmetry property of the aggregation function ensures that our neural network model can\nbe trained and applied to arbitrarily ordered node neighborhood feature sets. We examined three\ncandidate aggregator functions:\nMean aggregator. Our first candidate aggregator function is the mean operator, where we simply\ntake the elementwise mean of the vectors in {hk−1\nu ,∀u ∈N(v)}. The mean aggregator is nearly\nequivalent to the convolutional propagation rule used in the transductive GCN framework [17]. In\nparticular, we can derive an inductive variant of the GCN approach by replacing lines 4 and 5 in\nAlgorithm 1 with the following:4\nhk\nv ←σ(W ·MEAN ({hk−1\nv }∪{hk−1\nu ,∀u∈N(v)}). (2)\nWe call this modified mean-based aggregatorconvolutional since it is a rough, linear approximation of\na localized spectral convolution [17]. An important distinction between this convolutional aggregator\nand our other proposed aggregators is that it does not perform the concatenation operation in line\n5 of Algorithm 1—i.e., the convolutional aggregator does concatenate the node’s previous layer\nrepresentation hk−1\nv with the aggregated neighborhood vector hk\nN(v). This concatenation can be\nviewed as a simple form of a “skip connection” [13] between the different “search depths”, or “layers”\nof the GraphSAGE algorithm, and it leads to significant gains in performance (Section 4).\nLSTM aggregator. We also examined a more complex aggregator based on an LSTM architecture\n[14]. Compared to the mean aggregator, LSTMs have the advantage of larger expressive capability.\nHowever, it is important to note that LSTMs are not inherently symmetric (i.e., they are not permutation invariant), since they process their inputs in a sequential manner. We adapt LSTMs to operate on\nan unordered set by simply applying the LSTMs to a random permutation of the node’s neighbors.\n3Exploring non-uniform samplers is an important direction for future work.\n4Note that this differs from Kipf et al’s exact equation by a minor normalization constant [17].\n\nPooling aggregator. The final aggregator we examine is both symmetric and trainable. In this\npooling approach, each neighbor’s vector is independently fed through a fully-connected neural\nnetwork; following this transformation, an elementwise max-pooling operation is applied to aggregate\ninformation across the neighbor set:\nAGGREGATE pool\nk = max({σ\n(\nWpoolhk\nui + b\n)\n,∀ui ∈N(v)}), (3)\nwhere max denotes the element-wise max operator and σ is a nonlinear activation function. In\nprinciple, the function applied before the max pooling can be an arbitrarily deep multi-layer perceptron, but we focus on simple single-layer architectures in this work. This approach is inspired by\nrecent advancements in applying neural network architectures to learn over general point sets [29].\nIntuitively, the multi-layer perceptron can be thought of as a set of functions that compute features for\neach of the node representations in the neighbor set. By applying the max-pooling operator to each of\nthe computed features, the model effectively captures different aspects of the neighborhood set. Note\nalso that, in principle, any symmetric vector function could be used in place of the max operator\n(e.g., an element-wise mean). We found no significant difference between max- and mean-pooling in\ndevelopments test and thus focused on max-pooling for the rest of our experiments.\n\n[6 Conclusion]\nWe introduced a novel approach that allows embeddings to be efficiently generated for unseen nodes.\nGraphSAGE consistently outperforms state-of-the-art baselines, effectively trades off performance\nand runtime by sampling node neighborhoods, and our theoretical analysis provides insight into\nhow our approach can learn about local graph structures. A number of extensions and potential\nimprovements are possible, such as extending GraphSAGE to incorporate directed or multi-modal\ngraphs. A particularly interesting direction for future work is exploring non-uniform neighborhood\nsampling functions, and perhaps even learning these functions as part of the GraphSAGE optimization.\nAcknowledgments\nThe authors thank Austin Benson, Aditya Grover, Bryan He, Dan Jurafsky, Alex Ratner, Marinka\nZitnik, and Daniel Selsam for their helpful discussions and comments on early drafts. The authors\nwould also like to thank Ben Johnson for his many useful questions and comments on our code and\nNikhil Mehta and Yuhui Ding for catching some minor errors in a previous version of the appendix.\nThis research has been supported in part by NSF IIS-1149837, DARPA SIMPLEX, Stanford Data\nScience Initiative, Huawei, and Chan Zuckerberg Biohub. WLH was also supported by the SAP\nStanford Graduate Fellowship and an NSERC PGS-D grant. The views and conclusions expressed\nin this material are those of the authors and should not be interpreted as necessarily representing\nthe official policies or endorsements, either expressed or implied, of the above funding agencies,\ncorporations, or the U.S. and Canadian governments."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Pravallika6/cross-domain-embeddings")
# Run inference
sentences = [
'Are there any frameworks that adapt to different types of image segmentation tasks?',
'[ABSTRACT]\nWe propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models\ntrained with AGs implicitly learn to suppress irrelevant regions in an input image\nwhile highlighting salient features useful for a specific task. This enables us to\neliminate the necessity of using explicit external tissue/organ localisation modules\nof cascaded convolutional neural networks (CNNs). AGs can be easily integrated\ninto standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy.\nThe proposed Attention U-Net architecture is evaluated on two large CT abdominal\ndatasets for multi-class image segmentation. Experimental results show that AGs\nconsistently improve the prediction performance of U-Net across different datasets\nand training sizes while preserving computational efficiency. The source code for\nthe proposed architecture is publicly available.\n\n[1 Introduction]\nAutomated medical image segmentation has been extensively studied in the image analysis community\ndue to the fact that manual, dense labelling of large amounts of medical images is a tedious and\nerror-prone task. Accurate and reliable solutions are desired to increase clinical work flow efficiency\nand support decision making through fast and automatic extraction of quantitative measurements.\nWith the advent of convolutional neural networks (CNNs), near-radiologist level performance can\nbe achieved in automated medical image analysis tasks including cardiac MR segmentation [3] and\ncancerous lung nodule detection [17]. High representation power, fast inference, and filter sharing\nproperties have made CNNs the de facto standard for image segmentation. Fully convolutional\nnetworks (FCNs) [ 18] and the U-Net [ 24] are two commonly used architectures. Despite their\ngood representational power, these architectures rely on multi-stage cascaded CNNs when the target\norgans show large inter-patient variation in terms of shape and size. Cascaded frameworks extract a\nregion of interest (ROI) and make dense predictions on that particular ROI. The application areas\ninclude cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27] segmentation, and lung CT\nnodule detection [17]. However, this approach leads to excessive and redundant use of computational\nresources and model parameters; for instance, similar low-level features are repeatedly extracted by\nall models within the cascade. To address this general problem, we propose a simple and yet effective\nsolution, namely attention gates(AGs). CNN models with AGs can be trained from scratch in a\nstandard way similar to the training of a FCN model, and AGs automatically learn to focus on target\n1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\narXiv:1804.03999v3 [cs.CV] 20 May 2018\nstructures without additional supervision. At test time, these gates generate soft region proposals\nimplicitly on-the-fly and highlight salient features useful for a specific task. Moreover, they do not\nintroduce significant computational overhead and do not require a large number of model parameters\nas in the case of multi-model frameworks. In return, the proposed AGs improve model sensitivity and\naccuracy for dense label predictions by suppressing feature activations in irrelevant regions. In this\nway, the necessity of using an external organ localisation model can be eliminated while maintaining\nthe high prediction accuracy. Similar attention mechanisms have been proposed for natural image\nclassification [11] and captioning [1] to perform adaptive feature pooling, where model predictions\nare conditioned only on a subset of selected image regions. In this paper, we generalise this design\nand propose image-grid based gating that allows attention coefficients to be specific to local regions.\nMoreover, our approach can be used for attention-based dense predictions.\nWe demonstrate the implementation of AG in a standard U-Net architecture (Attention U-Net) and\napply it to medical images. We choose the challenging CT pancreas segmentation problem to provide\nexperimental evidence for our proposed contributions. This problem constitutes a difficult task due to\nlow tissue contrast and large variability in organ shape and size. We evaluate our implementation on\ntwo commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class abdominal CT-150.\nThe results show that AGs consistenly improve prediction accuracy across different datasets and\ntraining sizes while achieving state-of-the-art performance without requiring multiple CNN models.\n\n[2 Methodology]\nFully Convolutional Network (FCN):Convolutional neural networks (CNNs) outperform traditional approaches in medical image analysis on public benchmark datasets [14, 17] while being an\norder of magnitude faster than, e.g., graph-cut and multi-atlas segmentation techniques [34]. This\nis mainly attributed to the fact that (I) domain specific image features are learnt using stochastic\ngradient descent (SGD) optimisation, (II) learnt kernels are shared across all pixels, and (III) image\nconvolution operations exploit the structural information in medical images well. In particular, fully\nconvolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13] and holistically nested\nnetworks [16, 35] have been shown to achieve robust and accurate performance in various tasks\nincluding cardiac MR [3], brain tumours [12] and abdominal CT [26, 27] image segmentation tasks.\nConvolutional layers progressively extract higher dimensional image representations ( xl) by processing local information layer by layer. Eventually, this separates pixels in a high dimensional\nspace according to their semantics. Through this sequential process, model predictions are conditioned on information collected from a large receptive field. Hence, feature-map xl is obtained\nat the output of layer lby sequentially applying a linear transformation followed by a non-linear\nactivation function. It is often chosen as rectified linear unit: σ1 ( xl\ni,c) = max(0 ,xl\ni,c) where iand\nc denote spatial and channel dimensions respectively. Feature activations can be formulated as:\nxl\nc = σ1\n(∑\nc′∈Fl\nxl−1\nc′ ∗kc′,c\n)\nwhere ∗denotes the convolution operation, and the spatial subscript\n(i) is omitted in the formulation for notational clarity. The function f( xl; Φ l) = x(l+1) applied in\nconvolution layer lis characterised by trainable kernel parameters Φ l. The parameters are learnt by\n\nAttention GateLorem ipsum dolor sit amet,consectetur adipisicing elit, seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.\n x x x \n x x x \nReLU \nx \n \nSigmoid Resampler\nx x \nFigure 2: Schematic of the proposed additive attention gate (AG). Input features ( xl) are scaled\nwith attention coefficients (α) computed in AG. Spatial regions are selected by analysing both the\nactivations and contextual information provided by the gating signal (g) which is collected from a\ncoarser scale. Grid resampling of attention coefficients is done using trilinear interpolation.\nminimising a training objective, e.g. cross-entropy loss, using stochastic gradient descent (SGD).\nIn this paper, we build our attention model on top of a standard U-Net architecture. U-Nets are\ncommonly used for image segmentation tasks because of their good performance and efficient use\nof GPU memory. The latter advantage is mainly linked to extraction of image features at multiple\nimage scales. Coarse feature-maps capture contextual information and highlight the category and\nlocation of foreground objects. Feature-maps extracted at multiple scales are later merged through\nskip connections to combine coarse- and fine-level dense predictions as shown in Figure 1.\nAttention Gates for Image Analysis:To capture a sufficiently large receptive field and thus, semantic contextual information, the feature-map grid is gradually downsampled in standard CNN\narchitectures. In this way, features on the coarse spatial grid level model location and relationship\nbetween tissues at global scale. However, it remains difficult to reduce false-positive predictions for\nsmall objects that show large shape variability. In order to improve the accuracy, current segmentation\nframeworks [14, 26, 27] rely on additional preceding object localisation models to simplify the task\ninto separate localisation and subsequent segmentation steps. Here, we demonstrate that the same\nobjective can be achieved by integrating attention gates (AGs) in a standard CNN model. This does\nnot require the training of multiple models and a large number of extra model parameters. In contrast\nto the localisation model in multi-stage CNNs, AGs progressively suppress feature responses in\nirrelevant background regions without the requirement to crop a ROI between networks.\nAttention coefficients, αi ∈[0,1], identify salient image regions and prune feature responses to\npreserve only the activations relevant to the specific task as shown in Figure 3a. The output of AGs is\nthe element-wise multiplication of input feature-maps and attention coefficients: ˆxl\ni,c = xl\ni,c ·αl\ni. In\na default setting, a single scalar attention value is computed for each pixel vector xl\ni ∈RFl where\nFl corresponds to the number of feature-maps in layer l. In case of multiple semantic classes, we\npropose to learn multi-dimensional attention coefficients. This is inspired by [ 29], where multidimensional attention coefficients are used to learn sentence embeddings. Thus, each AG learns to\nfocus on a subset of target structures. As shown in Figure 2, a gating vector gi ∈RFg is used for\neach pixel ito determine focus regions. The gating vector contains contextual information to prune\nlower-level feature responses as suggested in [32], which uses AGs for natural image classification.\nWe use additive attention [2] to obtain the gating coefficient. Although this is computationally more\nexpensive, it has experimentally shown to achieve higher accuracy than multiplicative attention [19].\nAdditive attention is formulated as follows:\nql\natt = ψT (\nσ1 ( WT\nx xl\ni + WT\ng gi + bg)\n)\n+ bψ (1)\nαl\ni = σ2( ql\natt(xl\ni, gi; Θatt) ), (2)\nwhere σ2(xi,c) = 1\n1+exp(−xi,c) correspond to sigmoid activation function. AG is characterised\nby a set of parameters Θatt containing: linear transformations Wx ∈RFl×Fint, Wg ∈RFg×Fint,\nψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations are computed using\nchannel-wise 1x1x1 convolutions for the input tensors. In other contexts [33], this is referred to as\nvector concatenation-based attention, where the concatenated features xl and gare linearly mapped\nto a RFint dimensional intermediate space. In image captioning [1] and classification [11] tasks, the\n\nFigure 3(a): From left to right (a-e, f-j): Axial and sagittal views of a\n3D abdominal CT scan, attention coefficients, feature activations of\na skip connection before and after gating. Similarly, (k-n) visualise\nthe gating on a coarse scale skip connection. The filtered feature\nactivations (d-e, i-j) are collected from multiple AGs, where a subset\nof organs is selected by each gate. Activations shown in (d-e, i-j)\nconsistently correspond to specific structures across different scans.\nFigure 3(b): The ground-truth pancreas\nsegmentation (a) is highlighted in blue\n(b). Similarly, U-Net model prediction\n(c) and the predictions obtained with Attention U-Net (d) are shown. The missed\ndense predictions by U-Net are highlighted with red arrows.\nsoftmax activation function is used to normalise the attention coefficients (σ2); however, sequential\nuse of softmax yields sparser activations at the output. For this reason, we choose a sigmoid activation\nfunction. This results experimentally in better training convergence for the AG parameters. In contrast\nto [11] we propose a grid-attention technique. In this case, gating signal is not a global single vector\nfor all image pixels but a grid signal conditioned to image spatial information. More importantly, the\ngating signal for each skip connection aggregates information from multiple imaging scales, as shown\nin Figure 1, which increases the grid-resolution of the query signal and achieve better performance.\nLastly, we would like to note that AG parameters can be trained with the standard back-propagation\nupdates without a need for sampling based update methods used in hard-attention [21].\nAttention Gates in U-Net Model:The proposed AGs are incorporated into the standard U-Net\narchitecture to highlight salient features that are passed through the skip connections, see Figure\n1. Information extracted from coarse scale is used in gating to disambiguate irrelevant and noisy\nresponses in skip connections. This is performed right before the concatenation operation to merge\nonly relevant activations. Additionally, AGs filter the neuron activations during the forward pass as\nwell as during the backward pass. Gradients originating from background regions are down weighted\nduring the backward pass. This allows model parameters in shallower layers to be updated mostly\nbased on spatial regions that are relevant to a given task. The update rule for convolution parameters\nin layer l−1 can be formulated as follows:\n∂(ˆxl\ni)\n∂(Φl−1) = ∂\n(\nαl\nif(xl−1\ni ; Φl−1)\n)\n∂(Φl−1) = αl\ni\n∂(f(xl−1\ni ; Φl−1))\n∂(Φl−1) + ∂(αl\ni)\n∂(Φl−1) xl\ni (3)\nThe first gradient term on the right-hand side is scaled with αl\ni. In case of multi-dimensional AGs, αl\ni\ncorresponds to a vector at each grid scale. In each sub-AG, complementary information is extracted\nand fused to define the output of skip connection. To reduce the number of trainable parameters\nand computational complexity of AGs, the linear transformations are performed without any spatial\nsupport (1x1x1 convolutions) and input feature-maps are downsampled to the resolution of gating\nsignal, similar to non-local blocks [ 33]. The corresponding linear transformations decouple the\nfeature-maps and map them to lower dimensional space for the gating operation. As suggested in\n[11], low-level feature-maps, i.e. the first skip connections, are not used in the gating function since\nthey do not represent the input data in a high dimensional space. We use deep-supervision [16] to\nforce the intermediate feature-maps to be semantically discriminative at each image scale. This helps\nto ensure that attention units, at different scales, have an ability to influence the responses to a large\nrange of image foreground content. We therefore prevent dense predictions from being reconstructed\nfrom small subsets of skip connections.\n\n[4 Discussion And Conclusion]\nIn this paper, we presented a novel attention gate model applied to medical image segmentation. Our\napproach eliminates the necessity of applying an external object localisation model. The proposed\napproach is generic and modular as such it can be easily applied to image classification and regression\nproblems as in the examples of natural image analysis and machine translation. Experimental\nresults demonstrate that the proposed AGs are highly beneficial for tissue/organ identification and\nlocalisation. This is particularly true for variable small size organs such as the pancreas, and similar\nbehaviour is expected for global classification tasks.\nTraining behaviour of the AGs can benefit from transfer learning and multi-stage training schemes.\nFor instance, pre-trained U-Net weights can be used to initialise the attention network, and gates can\nbe trained accordingly in the fine-tuning stage. Similarly, there is a vast body of literature in machine\nlearning exploring different gating architectures. For example, highway networks [7] make use of\nresidual connections around the gate block to allow better gradient backpropagation and slightly\nsofter attention mechanisms. Although our experiments with residual connections have not provided\nany significant performance improvement, future research will focus on this aspect to obtain a better\ntraining behaviour. Lastly, we note that with the advent of improved GPU computation power and\nmemory, larger capacity 3D models can be trained with larger batch sizes without the need for image\ndownsampling. In this way, we would not need to utilise ad-hoc post-processing techniques to further\nimprove the state-of-the-art results. Similarly, the performance of Attention U-Net can be further\nenhanced by utilising fine resolution input batches without additional heuristics. Lastly, we would\nlike to thank to Salim Arslan and Dan Busbridge for their helpful comments on this work.',
'[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoffmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After fine-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adifferentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scfit pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9d𝑖º. Neural\nnetworks have proven to be powerful language models, first in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe benefits of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring efficient means of augmenting language\nmodels with a massive-scale memory without significantly increasing computations. Specifically, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkisthefirsttoshowthebenefits\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoffmann|sifre}@deepmind.com\narXiv:2112.04426v3 [cs.CL] 7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be fine-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method first\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simplified version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the first chunk only affect the last token of the first chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we filter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-fly is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We find that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three different residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classification\nhead defines an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a differentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\x001º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we first split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\x001 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\x001¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the first\n𝑚\x001 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\x001 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\x001 , C/a.sc¹ℎ𝑢𝑚¸𝑖\x001\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the first to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\x001º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢 A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢 C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢 F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻 E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻 A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻 C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻 F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is defined in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we define\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe first𝑚\x001 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndefine C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \x001¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simplified implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\x001º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9d𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are flexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more efficient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeffStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close offtheir noses, ears, and\nin the large pond off Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nflage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, fl es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their flat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof thefirst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their flat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting differences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.4898, -0.0854],
# [ 0.4898, 1.0000, -0.0389],
# [-0.0854, -0.0389, 1.0000]])
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What methods do language models use to predict mutation effects on proteins? |
[ABSTRACT] |
How can I efficiently model long sequences in machine learning? |
[ABSTRACT] |
What methods exist for learning from interconnected datasets? |
[ABSTRACT] |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32multi_dataset_batch_sampler: round_robinper_device_train_batch_size: 32num_train_epochs: 3max_steps: -1learning_rate: 5e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1label_smoothing_factor: 0.0bf16: Falsefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: noper_device_eval_batch_size: 32prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
nreimers/MiniLM-L6-H384-uncased