Add new SentenceTransformer model

e517c84 verified 17 days ago

295 kB

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:2970
  - loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/all-MiniLM-L6-v2
widget:
  - source_sentence: Can someone explain how to model dependencies in relational information?
    sentences:
      - >-
        [Preamble]

        Rodr´ ıguez et al.

        Generalized Boozer coordinates: a natural coordinate system for

        quasisymmetry

        E. Rodr ´ ıguez,1, 2,a) W. Sengupta,1, 2 and A. Bhattacharjee 1, 2

        1)Department of Astrophysical Sciences, Princeton University, Princeton,
        NJ,


        2)Princeton Plasma Physics Laboratory, Princeton, NJ, 08540

        (Dated: 22 September 2021)

        We prove the existence of a straight-ﬁeld-line coordinate system we call
        generalized Boozer coordinates. This

        coordinate system exists for magnetic ﬁelds with nested toroidal ﬂux
        surfaces provided

        ∮

        dl/B(j ·∇ψ) = 0,

        where symbols have their usual meaning, and the integral is taken along
        closed magnetic ﬁeld lines. All

        quasisymmetric ﬁelds, regardless of their associated form of equilibria,
        must satisfy this condition. This

        coordinate system presents itself as a convenient form in which to
        describe general quasisymmetric conﬁgurations and their properties.
        Insight can be gained analytically into the diﬀerence between strong and
        weak

        forms of quasisymmetry, as well as axisymmetry, and the interaction of
        quasisymmetry with diﬀerent forms

        of equilibria.

        I. INTRODUCTION:

        Quasisymmetric stellarators1–3 (sometimes referred to

        as helically-symmetric4,5) constitute an attractive choice

        for magnetic conﬁnement fusion. Theoretically, such

        designs exhibit transport properties analogous to those

        of axisymmetric devices 6 while possessing larger threedimensional
        freedom. The latter allows avoiding some

        of the inherent limitations of tokamaks. The quasisymmetric stellarator
        achieves this by possessing a hidden

        symmetry, namely, the magnitude of the magnetic ﬁeld

        |B|is symmetric, while the full vector B is not.

        The concept of quasisymmetry (QS) is elegant, but

        it appears to have a signiﬁcant theoretical limitation. It

        was soon realized7 that constructing a conﬁguration with

        exact QS is not possible. Although there is no deﬁnitive

        proof that shows this impossibility, work on near-axis

        expansions7,8 supports this point of view. The governing

        system of equations is overdetermined: there are more

        constraints than degrees of freedom. This limitation does

        not, however, prevent designs that exhibit behavior close

        to QS in a volume. 9,10

        Recently, studies that explore the concept of QS more

        deeply have appeared.3,11 The main idea has been to separate the concept
        of QS from that of macroscopic force

        balance. In these studies, QS is deﬁned as a property of

        the conﬁguration that confers the single-particle dynamics an
        approximately conserved quantity without making any statement about the
        form of macroscopic equilibrium. This perspective diﬀers signiﬁcantly
        from the

        standard approach. Prior to [3] and [12] (and with very

        few exceptions5,13), the concept of QS was framed in the

        context of magnetohydrostatic (MHS) equilibria satisfying j ×B = ∇p,
        where j = ∇×B is the current density

        and p is the scalar pressure. As MHS is not intrinsic to

        a)Author to whom correspondence should be addressed:
        eduardor@princeton.edu

        QS, it is important to deﬁne QS without reference to a

        particular form of equilibrium. Separating QS from equilibrium allows us
        to investigate more deeply its meaning

        and limitations.

        Abandoning the convenient form of MHS equilibrium,

        although conceptually appropriate, comes at a cost. The

        magnetic ﬁeld no longer needs to satisfy the condition j ·∇p = 0 ,
        implicitly assumed in most of the

        literature1,4,6. Hence, one cannot guarantee the existence

        of Boozer coordinates 2 as presently understood, even if

        magnetic ﬂux surfaces exist. Boozer coordinates are particularly
        convenient for studying QS, as it presents the

        symmetry in an explicit, simple form. This leads to a

        search for an analogous, convenient, but more general

        coordinate system for quasisymmetric conﬁgurations.

        In this paper, we construct such a coordinate system,

        which we call generalized Boozer coordinates (GBC).

        This coordinate system was used to formulate the nearaxis expansion in
        [14], in which a QS equilibrium with

        anisotropic pressure was shown to avoid the conventional

        problem of overdetermination. The present paper is organized as follows.
        First, we introduce, develop and discuss

        this coordinate system systematically and rigorously. We

        start by presenting a constructive proof for GBC and the

        class of ﬁelds for which it exists. We then present the set

        of equations describing a quasisymmetric magnetic ﬁeld

        in this coordinate system. This gives us the opportunity

        to gain an alternative 3,12 perspective on the distinction

        between the so-called weak and strong forms of quasisymmetry, as well as
        a comparison to axisymmetry. We close

        by summarizing the equations that link the equilibrium

        and the quasisymmetric ﬁeld and concluding remarks.

        II. GENERALIZED BOOZER COORDINATES

        A. Explicitly symmetric formulation of quasisymmetry

        Let us start this section by introducing the notion of

        QS. We consider QS from the recent general perspective

        arXiv:2106.08266v2  [physics.plasm-ph]  21 Sep 2021

        Rodr´ ıguez et al. 2

        of single particle motion. 3,11 QS (and in particular weak

        QS) is the property of the ﬁelds that confers the motion

        of guiding-centre single particles with an approximately

        conserved quantity.3 For the dynamics of particles to exhibit this
        conservation it is necessary for the magnetic

        ﬁeld to have nested ﬂux surfaces B ·∇ψ= 0 and satisfy,

        ∇ψ×∇B·∇(B ·∇B) = 0. (1)

        This form of quasisymmetry is most commonly referred

        to as the triple vector product formulation. Although

        this is not the form that comes directly from the singleparticle
        analysis (that form is the one used in Sec. IIIA),

        it is a succinct way to impose QS on a magnetic ﬁeld.

        Given that this generalized concept of QS has a single

        particle origin, no notion of macroscopic equilibrium is

        involved. Of course, from a practical point of view, any

        steady ﬁeld of interest will be in some form of force balance. With the
        deﬁnition of QS in (1), diﬀerent forms of

        equilibria may be investigated to understand how they

        interact with QS. One may generally refer to the macroscopic forces by
        F, and we do not attempt to assess their

        origin in this paper. Instead, we focus on the requirements at a ﬂuid
        level. A complete view of the problem would require an investigation of
        the kinetic basis

        of the forces to link the ﬂuid forces to microscopic behavior. This is
        particularly important as microscopic

        forces could break the symmetry (and with it, the QSrelated conserved
        momenta). An important example is

        the electrostatic potential, whose symmetry is needed to

        the appropriate order and imposes constraints even on

        the forces that arise from the plasma, such as centrifugal

        force or anisotropic pressure. With this in mind, we shall

        focus on the macroscopic aspects.

        Looking at the statement of QS in the form presented

        in Eq. (1), the existence of a symmetry in the problem

        is not apparent. In the usual context of j ×B = ∇p,

        this absence of obvious symmetry can be amended by

        using Boozer coordinates. In these coordinates, a ﬁeld in

        magnetostatic equilibrium with well-deﬁned ﬂux surfaces

        is quasisymmetric (aside from quasipoloidal symmetry)

        iﬀ,

        B = B(ψ,θ −˜αφ), (2)

        where {ψ,θ,φ }are Boozer coordinates, ψthe ﬂux surface

        label, θ and φ poloidal and toroidal angles respectively,

        and ˜α = N/M|M,N ∈Z. In order to employ Boozer

        coordinates, the j·∇ψ= 0 property of MHS equilibria is

        central. Boozer coordinates have the standard straight

        ﬁeld-line coordinate Jacobian

        J= Bφ + ιBθ

        B2 , (3)

        where the covariant Bθ and Bφ are ﬂux functions,

        B = Bθ(ψ)∇θ+ Bφ(ψ)∇φ+ Bψ∇ψ.

        Boozer coordinates are widely used in stellarator theory

        and applications. The coordinates are a natural set that

        simplify many of the governing equations, including QS.

        In particular, construction of solutions by near-axis expansion of
        three-dimensional ﬁelds is most convenient in

        these coordinates7,8,14–16.

        In the context of our general quasisymmetric B, it is

        not necessary to assume that j ·∇ψ = 0. However, we

        demonstrate that any quasisymmetric ﬁeld must satisfy∮

        (dl/B)(j ·∇ψ) = 0 (see Appendix A). The question

        that naturally arises is, can we construct an appropriate straight ﬁeld
        line coordinate system that explicitly

        expresses the symmetry in QS in the Boozer fashion?

        The answer is yes. To see this, we write Eq. (1) in

        straight-ﬁeld line coordinates {ψ,θ,φ } using the contravariant
        representation B = ∇ψ×(∇θ−ι∇φ),

        (∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB)·

        ·∇(J−1(∂φB+ ι∂θB)) = 0,

        where J is the Jacobian associated to the coordinate

        system chosen. Now assume we can construct a straightﬁeld line
        coordinate system with a Jacobian of the form

        J= J(ψ,B), then the Jacobian factor may be dropped

        from the equation above to obtain,

        (∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB) ·∇(∂φB+ ι∂θB) = 0

        i.e.

        [

        ∂φ −

        (∂φB

        ∂θB

        )

        ∂θ

        ]

        (∂φ + ι∂θ)B = 0

        assuming that ∂θB ̸= 0. (This makes quasi-poloidally

        symmetric solutions a special case.) From near-axis expansion we know
        that these solutions have a very restricted QS region3.

        Commuting operators, we obtain

        (∂φ + ι∂θ)

        (∂φB

        ∂θB

        )

        ∂θB = (b ·∇)

        (∂φB

        ∂θB

        )

        ∂θB = 0

        which implies that B = B(ψ,θ −˜αφ) or B = B(ψ,φ),

        where ˜α= −∂φB/∂θB is a ﬂux function. The additional

        requirement that ˜α is rational to avoid B = B(ψ) at

        non-degenerate surfaces requires ˜α to be constant.3

        In summary, if we are able to construct a straight ﬁeld

        line coordinate system that has a Jacobian that can be

        written as a function of ψ and B only, then a ﬁeld is QS

        in the weak sense if and only if B depends on a single

        linear combination of those coordinate angles. Note that

        under this assumption, the reverse direction of the proof

        is straightforward.

        B. Constructing generalized Boozer coordinates

        We deﬁne generalized Boozer coordinates(GBC) as a

        set of straight ﬁeld line coordinates whose Jacobian can

        be written in the form J= Bα(ψ)/B2, where Bα is some

        ﬂux function without requiring j ·∇ψ to vanish identically. This choice
        is more general than what is permitted

        by Boozer coordinates, which separately requires Bθ and

        Rodr´ ıguez et al. 3

        Bφ to be ﬂux functions. We shall now constructively explore the
        conditions under which such a coordinate system exists.

        Let us start with a given magnetic ﬁeld, assuming that

        it lies on well-deﬁned ﬂux surfaces. It then follows 2,17,18

        that there exists some initial straight ﬁeld line coordinate

        system {ψ,θ,φ }, so that B = ∇ψ×∇θ+ι∇φ×∇ψ. The

        deﬁnition of the angular straight ﬁeld line coordinates is

        arbitrary up to a transformation of the form

        θ= θ′+ ιω,

        φ= φ′+ ω.

        The function ω = ω(ψ,θ,φ ) is a well behaved periodic

        function that deﬁnes a family of transformations 18. The

        periodicity of ω preserves the toroidal and poloidal character of the
        two angular coordinates.

        Starting with some given straight-ﬁeld line coordinate

        system, we want to understand how to construct a set

        with a Jacobian of the GBC form. Thus, we need to

        know how to transform the coordinate Jacobian induced

        by ω. This reads

        J′−1 = ∇ψ×∇(θ−ιω) ·∇(φ−ω).

        Here {ψ,θ′,φ′}represents the newly deﬁned straight ﬁeld

        line coordinates whose associated Jacobian is J′. This

        equation may be recast into the form of a magnetic differential
        equation,

        B ·∇ω= 1

        J − 1

        J′. (4)

        Now, let us require19

        J′= Bα(ψ)

        B2 . (5)

        In order for the magnetic diﬀerential equation Eq. (4)

        to have a single-valued solution for the transformation

        function ω, Eq. (4) must satisfy Newcomb’s criterion 20.

        According to Newcomb, for a magnetic diﬀerential equation B ·∇f = s to
        have a single-valued solution for f,

        the source term smust satisfy the line-integral condition

        along a ﬁeld line,

        ∮

        sdl

        B = 0. (6)

        For our problem s= 1/J− 1/J′. Start with,

        ∮ 1

        J

        dl

        B =

        ∮

        B ·∇φdl

        B =

        ∮

        dφ= 2πn.

        Here we have used the deﬁnition of the Jacobian in terms

        of the magnetic ﬁeldB·∇φ= ∇ψ×∇θ·∇φ= 1/J, where

        φ and θ are, respectively, toroidal and poloidal angular

        coordinates that increase by 2π in going around the long

        or short toroidal circular paths. We also considered the

        rotational transform ι = n/m, where n,m ∈Z and are

        coprime.

        FIG. 1. Ribbon surface. Ribbon deﬁned by two closely

        lying rational magnetic ﬁeld lines labelled by C1 and C2.

        Now, consider Newcomb’s condition on the last term

        on the right-hand-side of Eq. (4), and deﬁne the closed

        line integral Iso that,

        ∮ 1

        J′

        dl

        B = I

        Bα(ψ). (7)

        For Newcomb’s condition on Eq. (4) to hold, the following

        must be true:

        Bα(ψ) = I

        2πn. (8)

        With this choice

        ∮

        B ·∇ωdl

        B = 0. (9)

        Then there must exist a single-valued solution ωthat enacts the
        coordinate transformation into the set associated

        with the Jacobian (5).

        For Eq. (8) to hold, it is necessary for Ito be a ﬂux

        function. This condition may be written in the form,

        I(ψ) =

        ∮

        B ·dr, (10)

        along a magnetic ﬁeld line. Focus on the case of rational

        ﬁeld lines, for which the condition is most stringent. To

        proceed further, consider a non-self-intersecting ribbon

        over a ﬂux surface bounded by two adjacent magnetic

        ﬁeld lines (see Fig. 1). Denote the line integrals along

        each of these lines by C1 and C2. We may then write,

        using Stoke’s theorem,

        ∮

        C1

        B ·dr −

        ∮

        C2

        B ·dr =

        ∫

        rib

        j ·dS, (11)

        where the surface element is taken to be perpendicular

        to the ﬂux surface. The integral over the surface may be

        written as,18

        ∫

        rib

        j ·dS =

        ∫ α0+δα

        α0

        dα

        ∮ dl

        Bj ·∇ψ.

        Here α labels the ﬁeld lines on the surface. Now, if Iis

        to be truly a ﬂux function, then following (11) the last

        Rodr´ ıguez et al. 4

        surface integral must vanish. And it must do so for all

        ﬁeld line labels α0 and adjacent ﬁeld lines δα. This gives

        the necessary and suﬃcient condition,

        ∮ dl

        Bj ·∇ψ= 0. (12)

        This subclass of magnetic ﬁelds grant the required form

        of I, and thus the single-valued solution to Eq. (4). This

        subclass includes the stronger constraint j ·∇ψ= 0 (for

        which the coordinates reduce to Boozer coordinates), or

        more to our concern, the QS scenario (see Appendix A).

        Note that magnetic ﬁelds that possess nested surfaces

        have the property ⟨j ·∇ψ⟩= 0, following an application

        of ∇·(B ×∇ψ). However, the Newcomb condition (12)

        is more stringent.20

        Restricting ourselves to this subclass of ﬁelds, we can

        always choose Bα as in (8). This choice might appear

        artiﬁcial, especially given the presence of the discrete n.

        However, we may relate this to the surface average over

        irrational ﬂux surfaces. To see this, we write,

        Bα = 1

        2πn

        ∮

        B2 dl

        B = 1

        4π2n

        ∫ 2π


        dα

        ∮

        B2 dl

        B  

        n turns

        =

        = 1

        4π2

        ∫ 2π


        dα

        ∮

        B2 dl

        B  

        1 turn

        = V′

        4π2 ⟨B2⟩, (13)

        where V′ =

        ∫2π

        0 dα

        ∫

        dl/B = V′⟨1⟩is the usual volume

        ψ derivative. Bα is, therefore, a well behaved quantity

        for both rational and irrational surfaces.

        As it stands, with an appropriate choice of Bα, and

        restricting ourselves to the subclass of ﬁelds satisfying

        Eq. (12), one may perform a coordinate transformation

        that provides a Jacobian of the desired form (5). It remains to be shown
        that this new coordinate system is

        well-behaved. By this we mean that the Jacobian does

        not vanish nor diverge. To see this, it is most useful to

        rewrite J′in terms of (13),

        J′= Bα

        B2 = V′

        4π2

        ⟨B2⟩

        B2 . (14)

        It is clear that J′has a deﬁnite sign, as given by the sign

        of V′. Thus, the Jacobian will never vanish nor diverge,

        given that B2 >0.

        C. Magnetic ﬁeld in generalized Boozer coordinates

        We have shown that under the assumption that the

        magnetic ﬁeld is quasisymmetric, we have a straightﬁeld-line coordinate
        system, GBC, with a Jacobian J =

        Bα(ψ)/B2. In this coordinate system and following

        Sec. II A, a quasisymmetric ﬁeld is one whose magnitude

        can be expressed in GBC as B = B(ψ,θ −˜αφ). This

        form is analogous to the Boozer formulation of QS but

        is only subordinated to the conﬁguration being QS and

        not j ·∇ψ= 0.

        Before proceeding to analyze some of the properties of

        QS and other governing equations in GBC, we explicitly

        write the covariant and contravariant forms for B. The

        covariant form is

        B = Bθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ. (15)

        The usual covariant function Bφ has been deprecated for

        Bα. This is the ﬂux function that appears in the Jacobian

        (5). The simplicity of (15) shows why it was convenient

        to choose the Jacobian of GBC to have the particular

        form in (5). To obtain (15), it is suﬃcient to take its

        scalar product with the contravariant representation,

        B = ∇ψ×∇θ+ ι∇φ×∇ψ, (16)

        and capitalize on the deﬁnition of J. Compared to

        Boozer coordinates, the covariant function Bθ in GBC is

        not necessarily a ﬂux function. Thus, GBC is an extension of Boozer
        coordinates. When j ·∇ψ = 0, however,

        and as previously pointed, GBCs reduce to Boozer coordinates. So far,
        the forms in (15) and (16) have only

        required of the existence of GBC, and not of QS per se

        –other than to guarantee (12). To enforce the latter, one

        needs to specify |B|as a symmetric function. 14,16,21

        III. DESCRIBING WEAK QUASISYMMETRY IN GBC

        Having developed GBC, let us see what this coordinate

        system can teach us about weak QS. We ﬁrst write down

        the complete set of equations that describe a weakly quasisymmetric
        magnetic ﬁeld. The ﬁrst relevant equation

        equates the covariant (15) and contravariant (16) forms

        of the magnetic ﬁeld. The equation reads,

        Bθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ=

        = ∇ψ×∇θ+ ι∇φ×∇ψ. (17)

        To specify that we are using GBC, and to introduce the

        quasisymmetric condition, we require

        ∇ψ×∇θ·∇φ= Bα(ψ)

        B2(ψ,θ −˜αφ). (18)

        This set of equations (17) and (18) is entirely selfcontained. It
        describes a general magnetic ﬁeld (a vector ﬁeld satisfying ∇· B = 0 = B
        ·∇ψ, without a particular form of equilibrium) that is quasisymmetric —

        no more, no less. Such equations have been recently

        used in near-axis expansions with anisotropic pressure

        equilibria.14,16 Equations (17) and (18), referred to as

        the co(ntra)variant and Jacobian equations respectively,

        were there expanded systematically. A more thorough

        and complete exploration of the expansion of the magnetic equations and
        its implications will be presented in

        a separate publication.

        Rodr´ ıguez et al. 5

        Equations (17) and (18) apply beyond near-axis expansions.
        Alternatively, one could study the behavior

        of this system in a perturbative sense but around a surface rather than
        the axis. 21 This could shed some light

        to standard optimization approaches to QS 9,10.

        Beyond the set of equations (17) and (18), the formulation of weak QS in
        terms of GBC can be used to compare

        weak QS to other (quasi)symmetric forms. This comparison to strong and
        axisymmetry will help frame the notion

        of weak QS in the larger space of conﬁgurations.

        A. Comparison to strong quasisymmetry

        Strong QS is a more resrictive form of QS compared to

        its weak form3,12. Weak QS is the necessary and suﬃcient

        condition for the leading guiding centre dynamics to have

        an approximately conserved momentum to leading gyrocentre order.3 This
        condition can be written as in (1),

        but is most naturally given as,

        u ·∇B = 0, (19)

        B ×u = ∇Φ(ψ), (20)

        ∇·u = 0. (21)

        The vector ﬁeldu is deﬁned by these equations and points

        in the direction of symmetry. In strong QS the conserved momentum for
        the particle dynamics is exact for

        the ﬁrst-order guiding centre Lagrangian.3,5,12 In the notation that
        explicitly introduces the symmetry vector u,

        strong QS is equivalent to weak QS – i.e., Eqs. (19)-(21)–

        plus the constraint

        j ×u + ∇C = 0, (22)

        where C = B·u. Note that only the b component of this

        additional constraint is contained in the weak formulation of the
        problem. We now explore the signiﬁcance of

        (22) in the context of GBC.

        To begin, we need to construct u. From Eqs. (19)(21),3

        u = ¯ι∇ψ×∇χ

        B ·∇χ , (23)

        where χ= θ−˜αφ, and it is convenient to use χas part of

        the coordinate triplet {ψ,χ,φ }. We have made the choice

        Φ′= ¯ι so that u ·∇φ = 1, for simplicity. Scaling of the

        ﬂux label for u leaves the weak QS conditions unchanged

        (see Appendix B for further discussion on the gauge and

        the particular choice). This form of u is equivalent to

        Eqs. (19)-(21) only if we enforce B = B(ψ,χ) and the

        coorrdinate system B ·∇χ= ¯ιJ−1 with the Jacobian in

        (5). The parameter ¯ι= ι−˜α has been deﬁned.

        From (15) and (23), we ﬁnd,

        C = Bα −¯ιBθ. (24)

        This relation (24), together with (23) and the deﬁnition

        of C can be used to write the gauge-independent form,

        B ·∇ψ×∇B = Bα −¯ιBθ

        ¯ι B ·∇B. (25)

        The magnetohydrostatic form of this equation has been

        used previously18,22.

        The contravariant form of u together with (16) give

        simple expressions for directional derivatives in GBC.

        The diﬀerential operators can be simply written as partial derivatives,

        B ·∇ = J−1 (¯ι∂χ + ∂φ) , u ·∇ = ∂φ. (26)

        Since the Jacobian is quasisymmetric from (5), the two

        operators B·∇and u·∇commute with each other11. This

        commutation property is made manifest in the GBC.

        The symmetry ﬁeld u can also be written in the following covariant form:

        u = uψ∇ψ+ uχ∇χ+ uφ∇φ. (27)

        Taking the scalar product with u and B we obtain

        uφ = u2, ¯ιuχ + uφ = C

        J−1 . (28)

        To complete the family of vectors required for the

        strong quasisymmetric condition (22), we need a closed

        form for j in GBC. From the curl of the covariant form

        of B in Eq. (15), we obtain

        j = ∇Bψ×∇ψ+ ∇Bθ×∇χ+ ∇(Bα−¯ιBθ) ×∇φ. (29)

        Using (29) and (24), we can show that

        j×u+∇C = (u·∇Bψ)∇ψ+ (u·∇Bθ) (∇χ−¯ι∇φ) .

        (30)

        The simplicity of (30) is due to the choice of GBC.

        Recall that strong QS requires the expression j ×u +

        ∇C to be identically zero. This means that all the terms

        on the right side of (30) need to vanish. That is to say, the

        covariant functions (Bψ,Bθ) are required to be quasisymmetric. If Bθ is
        quasisymmetric, then C is automatically

        so from (24). In an explicit coordinate representation,

        using (26), we may write Bθ(ψ,χ) and Bψ(ψ,χ).

        Thus, the GBC representation provides an elegant way

        to formulate strong QS, which can now be understood as

        weak quasisymmetry plus the conditions that Bψ and Bθ

        are QS. In other words, not only is B QS but so are Bθ

        and Bψ.

        a. Implications for near-axis expansion . We refer

        to [14] and [16] for a detailed treatment using near-axis

        expansions of the weakly quasisymmetric problem. The

        procedure is based on expanding all governing equations

        describing the weak quasisymmetric ﬁeld in some form of

        equilibrium in powers of the distance from the magnetic

        axis. To do so eﬃciently, both the ﬁeld and equations

        –including Eqs. (17) and (18)– are expressed in GBC. It

        was shown there that when equilibria with anisotropic

        pressure are considered, the common overdetermination

        problem7,8 that limits the expansion is overcome. The

        number of governing equations becomes the same as that

        of functions to be solved.

        Rodr´ ıguez et al. 6

        Extending the expansion to strong QS is straightforward. From the
        discussion above, the only diﬀerence

        is that the covariant functions Bθ and Bψ are both

        quasisymmetric rather than general functions of space.

        In practical terms, this simply leads to more restricted

        Taylor-Fourier expansions of those functions; the coeﬃcients that were
        functions of φ become constants. This

        restriction in freedom once again leads us back to the

        Garren-Boozer overdetermination problem. In fact, it

        does so the same way as it did in the case of MHS equilibrium. The
        restriction on the covariant functions imposes

        very severe constraints on the allowed geometry. The

        only way to escape this impasse is to assume axisymmetry (φ
        independent). Once again, consistent with what

        we observed in [14], the asymmetry of the covariant representation of B
        appears to be vital to the construction

        of QS solutions.

        B. Comparison to axisymmetry

        We have seen that the strong formulation of QS is more

        constraining than its weak form. We would also like to

        compare QS to the limiting case of axisymmetry. We

        shall think of this case as a symmetry generated by rotation in space:
        the system is invariant under rotations

        about an axis. In Euclidean space, a rotation is an isometry, and it is
        generated by a vector ﬁeld known as a

        Killing vector. Using the notion of a Killing vector, we

        want to explore how ‘far’ the weak concept of QS is from

        this ‘true’ symmetry.

        A measure of the departure of a symmetry generator from a Killing vector
        is the so-called deformation

        metric.12 Taking u to represent the symmetry vector for

        QS, the idea is to see how far it is from being a Killing

        vector. A vector ﬁeld v is Killing if and only if the deformation tensor
        Lvg= 0. Here Lu denotes the Lie derivative along u and g is the
        Euclidean metric. In 3D, this

        may be written as,

        Lug= ∇u + (∇u)T. (31)

        Evaluating this tensor for a quasisymmetric conﬁguration

        should then provide information regarding the closeness

        to an isometry. It is convenient12 to evaluate this rank-2

        tensor in a basis deﬁned by {B,u,∇ψ}, a triplet that we

        shall take to be independent. Then,

        [

        ∇u + (∇u)T]

        ·B =j ×u + ∇C, (32)

        [

        ∇u + (∇u)T]

        ·u =∇u2 −u ×∇× u, (33)

        [

        ∇u + (∇u)T]

        ·∇ψ=∇ψ×∇× u + 2∇ψ·∇u, (34)

        where the weak quasisymmetric properties have been

        used where necessary. Equation (33) is what Burby et al.

        call w.11 We shall explore this vector w in more detail

        after obtaining an explicit form for Lug. Using (32)-(34),

        and projecting once again onto the non-orthogonal basis

        triplet, we obtain

        Lug=

        

        

        0 u ·∇C ∇ψ·(j ×u + ∇C)

        ... u ·w w ·∇ψ

        ... ... −u ·∇|∇ψ|2

        

        . (35)

        The matrix is symmetric by construction. The content

        of its elements can be made clearer using GBC explicitly.

        The top row, corresponding to (32), has already been

        dealt with, as it is precisely the piece corresponding to

        strong QS. We made the observation that for this condition to be
        satisﬁed, (30) required u·∇Bθ = 0 = u·∇Bψ.

        For the other components, the vector ﬁeld w = ωu ×

        u + ∇u2, ω u = ∇×u is key. Using the covariant form

        of u we obtain the curl of the vector u in the form

        ωu = ∇uψ ×∇ψ+ ∇uχ ×∇χ+ ∇uφ ×∇φ. (36)

        Taking the cross product with u, using the orthogonality

        of u with ∇ψ and ∇χ, and (24) and (28) we get

        w = (u ·∇uψ)∇ψ+ (u ·∇u2)

        (

        ∇φ−1

        ¯ι∇χ

        )

        +

        −J(u ·∇Bθ)∇χ, (37)

        which implies that

        B ·w = −¯ιu ·∇Bθ, (38)

        u ·w =u ·∇u2. (39)

        Most importantly, a vanishing w implies that the covariant components of
        the symmetry vector as well as Bθ are

        quasisymmetric.

        To complete the simpliﬁcation of the metric tensor, we

        invoke B2 = ( C2 + |∇ψ|2)/u2, which follows from the

        deﬁnition of u. This means,

        u ·∇|∇ψ|2 = B2u ·∇u2 −u ·∇C2. (40)

        With this coordinate representation, the dependence of

        the various metric pieces is made explicit. We may then

        schematically present the dependence of Lug as follows,

        Lug∼

        

        

        0 ∂φBθ ∂φBθ, ∂φBψ

        ... ∂ φu2 ∂φu2, ∂φuψ

        ∂φBθ

        ... ... ∂φBθ, ∂φu2

        

        . (41)

        The boxed expressions are meant to indicate that the

        corresponding tensor component vanishes if the expressions there do. If
        the tensor (41) is to vanish, then the

        symmetry vector would correspond to a rotation. This is

        not surprising if one looks at what it means for the components in (41)
        to vanish. Axisymmetry is reached when

        the covariant components of the magnetic ﬁeld and the

        symmetry vector are themselves symmetric. The latter is

        intimately related to the geometry, as we may see when

        writing u ∝∂φx|χ,ψ.

        Rodr´ ıguez et al. 7

        From (41), it follows that in some sense, weak QS is

        far from being an isometry. This is so because only one

        of the components of the tensor exactly vanishes. The

        φ dependence of the functions Bψ, Bθ, u2, and uψ takes

        the conﬁguration away from axisymmetry. These apparent four degrees of
        freedom (especially those involving

        u) may not be independent and involve highly non-linear

        combinations—they should ultimately be related through

        the quasisymmetric magnetic equations. Anyhow, the

        ﬁeld-line dependence is key in distinguishing the weakly

        quasisymmetric form from, say, an axisymmetric tokamak.

        To make a comparative measurement of the departure

        from axisymmetry, consider now the case of strong QS.

        In this case, following (22), the ﬁrst whole row (and thus

        also column) of (41) drop. The remaining dependence

        also simpliﬁes, and the system is precluded from being

        an isometry, a priori, through the φ dependence of u2

        and uψ only, which is consistent with the work by Burby

        et al.12

        Imposing additional properties to the ﬁeld may also

        aﬀect the form of the deformation tensor. An example

        would be a particular form of force balance. We now

        explore how the magnetics and equilibria are linked.

        IV. QUASISYMMETRY AND EQUILIBRIA

        Let us consider the force balance part of the problem.

        Generally, a magnetic equilibrium with some arbitrary

        force F reads,

        j ×B = F. (42)

        As we argued in Sec. II, we are concerned in this work

        with a general ﬂuid force F. Its connection to the microphysics is not
        considered. Let us express the left-hand

        side of (42) in GBC. Using the contravariant form for

        the current (29) together with (24), we obtain

        j ×B =

        [

        B ·∇Bψ −J−1 (B′

        α −¯ι′Bθ)

        ]

        ∇ψ+

        + (B ·∇Bθ) (∇χ−¯ι∇φ) , (43)

        which is an explicit coordinate representation of j ×B.

        The form of (43) mirrors the form of (30). In this case,

        the magnetic diﬀerential operators substitute the directional
        derivatives along u. We note that (43) does not

        have any component along B, as can be checked by taking the dot product
        with B.

        The form of (43) puts constraints on the allowable

        forms for F. As already noted, F ·B = 0 must hold

        true. Otherwise the system would be imbalanced, as the

        magnetic ﬁeld is unable to exerts forces alongB. Because

        of this reduction in the dimensionality of F and in view

        of (43), it is convenient to write

        F = Fψ∇ψ+ Fα(∇χ−¯ι∇φ) . (44)

        An alternative form would be to use the contravariant

        form of B (16) in (42) to get

        F = (J ·∇χ−¯ιJ ·∇φ) ∇ψ−(J ·∇ψ) (∇χ−¯ι∇φ) .

        (45)

        Substituting (43) and (44) into (42) we get two magnetic

        diﬀerential equations

        B ·∇Bψ = Fψ + J−1 (B′

        α −¯ι′Bθ) , (46a)

        B ·∇Bθ = Fα = −J ·∇ψ. (46b)

        Therefore, the generalized force-balance condition is

        equivalent to two magnetic diﬀerential equations (MDEs)

        and B ·F = 0. If solutions to these equations can be

        found together with the magnetic equations (17) and

        (18), we will have obtained a quasisymmetric conﬁguration in
        equilibrium.

        Let us describe in more detail the implications of these

        equations. First look at the simpler (46b). This equation

        has two pieces to it. First, and regardless of the assumed

        form for Fα, it follows from weak QS that B ·∇Bθ =

        −j ·∇ψ (see Appendix A as well). This imposes the

        condition (II B) on the ﬁeld. Secondly, the component

        Fα of the forcing F directly sets the oﬀ-surface current.

        This means that,20

        ∮ dℓ

        BFα = 0. (47)

        We may not choose Fα arbitrarily. It must satisfy (47) if

        the force is to be consistent with QS –condition like (12).

        Then the magnetic diﬀerential equation can be satisﬁed,

        and one can directly relate Bθ and Fα up to a ﬂux function. In Fourier
        representation, it is clear that the φcontent of Bθ will be non-zero
        only if that of Fα is (and vice

        versa). In the light of (41), choosing ∂φ(j·∇ψ) = 0 brings

        the quasisymmetric conﬁguration closer to an isometry.

        This freedom in the form of Bθ does not exist in the

        strong formulation of QS, for which j·∇ψis independent

        of the ﬁeld line label.

        A similar analysis is suitable for (46a). The appropriate Newcomb
        condition in this case is,

        ∮ dℓ

        B

        [

        Fψ + J−1 (B′

        α −¯ι′Bθ)

        ]

        = 0. (48)

        This condition may be understood as an averaged radial

        equilibrium equation. A similar solvability condition can

        be found, for the special case of MHS, in [4], where the

        notion of a QS ﬁeld is presented as one that satisﬁes the

        Newcomb conditions. Given Eq. (48), Eq. (46a) relates

        Bψ, Bθ (or Fα) and Fψ. Once again, we see the close

        relationship between the forcing, the magnetic covariant

        representation, and the deviation from axisymmetry. Aφ

        dependence on Bψ will bring a ﬁnite deviation of u from

        being a Killing vector. However, it will also force Fα

        and Fψ to have a φ dependence which may require very

        particular shaping of the forces. This observation is consistent with
        the Constantin-Drivas-Ginsberg theorem 23,

        Rodr´ ıguez et al. 8

        in which the forcing is seen to be intimately related to the

        deviation from an isometry. Here the asymmetric geometry, quasisymmetry,
        and the forcing are all intimately

        connected.

        When the magnetic diﬀerential equations imposing

        force balance are brought together with the magnetic

        equations from the previous section, it is not obvious

        how the system of equations is to be interpreted: what

        is to be taken as an input and what should be solved for.

        Just as an analogy, in the Grad-Shafranov equation, it is

        clear that p and F are inputs, and ψ is the output. In

        the present problem, we have the construction of GBC

        in addition to the various magnetic covariant forms and

        the components of F. Motivated by the treatment in [3]

        (which deals with a special case of the above), we propose as a
        possibility for a well-posed problem (46b) to be

        solved for Bθ given Fα, while Fψ is the output of (46a),

        with the function Bψ speciﬁed from the magnetic equations. It is not a
        trivial matter to determine a convenient

        way in which to formulate the problem. A more elaborate discussion on
        this procedure and its implications on

        constructing solutions is left to future work.

        A. Ideal MHD: j ×B = ∇p(ψ)

        Let us now revisit the limit of ideal MHD without

        ﬂows, j ×B = ∇p. More general forms will be discussed

        elsewhere, together with a more systematic treatment of

        the quasisymmetric system of equations.

        In ideal MHD, from j ·∇p(ψ) = 0 it follows that

        B ·∇Bθ = 0. (49)

        Thus taking Fα = 0 forces Bθ into a ﬂux function. Of

        course, this also means that u·∇Bθ = 0. Furthermore, as

        Fψ = p′, in static ideal MHD, (43) leads to the magnetic

        diﬀerential equation for Bψ,

        B ·∇Bψ = p′(ψ) + J−1 (B′

        α −¯ι′Bθ) (50)

        Since Bψ must be a single-valued function the ﬂuxsurface average of (50)
        gives

        p′(ψ) + ⟨B2⟩

        Bα

        (B′

        α −¯ι′Bθ) = 0. (51)

        If we choose the forms ofB, p, and ι, this pins the form of

        Bα down. Now, looking back to (50), every term on the

        right-hand side is quasisymmetric. Therefore, Bψ must

        also be quasisymmetric if it is to satisfy the force-balance

        equation. Note that we have already recognized this constraining
        requirement on the form of Bψ as the origin

        of the Garren-Boozer overdetermination problem14. The

        Newcomb condition on this equation can be recognized

        as the condition to avoid Pﬁrsch-Schl¨ uter current singularities.

        The simpliﬁcations due to ideal static MHD leads to

        the vanishing of (30). Therefore, in this limit weak QS

        is identical to strong QS.

        One can further show using (29), (15), (23) and (50),

        j = −1

        ¯ι∂ψ(Bα −¯ιBθ)B −1

        ¯ιp′(ψ)u, (52)

        where the gauge choice for u has been made Φ′= ¯ι. The

        expression for j ought to be u-gauge independent, as it is

        a physical quantity. The B piece as written is gauge independent, but
        the u term is not. The ¯ιfactor in the latter

        is to be interpreted as Φ ′. This equation had been obtained previously
        11,12 using coordinate free, diﬀerential

        forms (see Appendix C). Two special cases of ideal MHD,

        a) vacuum (j = 0) and b) force-free ( j = λ(ψ,α)B), are

        worth pointing to for their importance in plasma physics.

        For both these cases p′(ψ) = 0. From (52) we see that for

        the magnetic ﬁeld to be curl-free (vacuum) and quasisymmetric C′(ψ) = 0;
        i.e., C(ψ) must be a constant. For quasisymmetric force-free ﬁelds, we
        must have λ= −C′(ψ).

        Note that in strong QS these conclusions follow directly

        from the equation j ×u+∇C = 0 with j = 0 and j = λB.

        V. CONCLUSIONS

        In this paper, we have presented, deﬁned, and discussed a straight ﬁeld
        line coordinate system which is

        natural for the representation of general-equilibria quasisymmetric
        magnetic ﬁelds: generalized Boozer coordinates. We proved the existence
        of the said coordinate system for the subset of ﬁelds for which

        ∮

        j·∇ψdl/B = 0, to

        which quasisymmetric ﬁelds belong. These coordinates

        reduce to Boozer coordinates when j ·∇ψ= 0.

        The explicit form of the symmetry in this coordinate

        representation enables a simple formulation of the quasisymmetric
        problem. We explicitly construct the governing equations setting clearly
        the foundation for future investigations, including expansion 14,16 and
        global

        approaches. Exploiting GBC, we explicitly show the essential diﬀerences
        between the weak and strong formulations of QS and between quasisymmetry
        and axisymmetry. Weak QS generally lies far from axisymmetry, for

        which many functions describing the ﬁeld and symmetry

        need to be symmetric.

        We also included a set of simple magnetic diﬀerential

        equations that fully describe equilibrium with an arbitrary macroscopic
        force to complete the treatment. The

        property of QS, together with the force-balance structure, imposes
        requirements on the forcing terms in the

        form of Newcomb conditions. In addition, the equations

        establish clear connections between QS, forcing, and departures from
        axisymmetry.

        ACKNOWLEDGEMENTS

        We are grateful to J. Burby, N. Duigan, J. Meiss,

        and D. Ginsburg for stimulating discussions. This research is supported
        by grants from the Simons FoundaRodr´ ıguez et al. 9

        tion/SFARI (560651, AB) and DoE Contract No DEAC02-09CH11466.

        DATA A VAILABILITY

        Data sharing is not applicable to this article as no new

        data were created or analyzed in this study.

        Appendix A: Oﬀ-surface current and QS

        In this appendix we directly show how the triple vector

        formulation of QS in Eq. (1) imposes a constraint on the

        oﬀ-normal component of the current at every magnetic

        surface.

        Let us start by noting that B ·∇B = f(ψ,B) is a

        consequence of (1). In that sense, we can reshape the

        triple vector condition in the convenient form,

        ∇ψ×∇B·∇

        ( B2

        B ·∇B

        )

        = 0. (A1)

        (For now we shall not worry about B·∇B = 0.) We may

        express this as a divergence as ∇·(∇ψ×∇B) = 0. Separating the argument
        of the divergence into a component

        parallel to the magnetic ﬁeld and perpendicular to it, we

        may rewrite (A1),

        ∇·

        [∇ψ×∇B·B

        B ·∇B B + ∇ψ×B

        ]

        = 0. (A2)

        It is customary to deﬁne C = ∇ψ×∇B·B/(B ·∇B).

        This can therefore be rewritten, using ∇·B = 0,

        B ·∇C = j ·∇ψ. (A3)

        It is then clear that the Newcomb condition 20,

        ∮

        j ·∇ψdl

        B = 0, (A4)

        with the line integral taken along closed magnetic ﬁeld

        lines. For irrational ﬁeld lines, this amounts to ⟨j·∇ψ⟩=

        0, which is true of all magnetic ﬁelds with ﬂux surfaces

        B ·∇ψ = 0. The condition (A4) is a more constraining

        condition than the ﬂux surface average one 20.

        The division byB·∇Bseems to be ill-deﬁned at the extrema of the magnetic
        ﬁeld along the ﬁeld lines. However,

        the above holds, ﬁrst, arbitrarily close to these extrema,

        and thus would expect to hold by continuity. The fact

        that the condition holds can be seen by looking at the

        fundamental formulation of QS. 3 As for the behavior of

        C, this is a physical quantity related to the width of banana orbits of
        bouncing particles. If deeply trapped and

        barely trapped particles are to be kept, then physically

        C must be ﬁnite at the extrema. Once we express everything in terms of
        GBC, the apparent divergence of C

        disappears as both the numerator and denominator are

        proportional to ∂χB = 0 when B ·∇B = 0.

        Appendix B: Gauge freedom in weak u

        As noted, the deﬁnition of u by the weak quasisymmetry equations
        (19)-(21) is invariant to the relabelling of

        Φ. A rescaling of the ﬂux surface label together with u

        leaves the equations invariant, and should therefore have

        no physical implication in the description of quasisymmetry. Keeping the
        general gauge label Φ( ψ) of (20),

        Eq. (23) reads,

        u = ∇Φ ×∇χ

        B ·∇χ . (B1)

        Then it folllows,

        C = Φ′

        ¯ι (Bα −¯ιBθ). (B2)

        and

        u ·∇ = Φ′

        ¯ι ∂φ. (B3)

        In Sec. III A we then went on to evaluate how the symmetry vector u as
        deﬁned by weak QS compares to strong

        QS. To do so, we needed to evaluate (30), the third of

        the strong QS conditions. The ﬁrst thing to note is

        that this equation is not gauge-invariant. This is, of

        course, an added complication to the problem. In a sense,

        this gauge-symmetry breaking leaves no unique form for

        j ×u + ∇C, but rather a whole family.

        This can be seen explicitly if instead of restricting ourselves to the
        special case Φ ′ = ¯ι, we keep Φ general.

        Then, (30) can be written in the form,

        j ×u + ∇C =

        [

        u ·∇Bψ −CΦ′

        ¯ι ∂ψ

        ( ¯ι

        Φ′

        )]

        ∇ψ+

        + (u ·∇Bθ) (∇χ−¯ι∇φ) . (B4)

        So the question may be reformulated: what are the conditions for this
        expression to vanish? As before one piece

        is u ·∇Bθ = 0. Now, the other component has the form

        of a diﬀerential equation along the direction of symmetry.

        We may write it explicitly as,

        ∂φBψ −(Bα −¯ιBθ)∂ψln(¯ι/Φ′) = 0. (B5)

        For the equation to be consistent it must then hold that

        either ¯ι= aΦ′where ais a constant, or

        ∫

        (Bα−¯ιBθ)dφ=

        0. The latter is a condition on the physically relevant

        part of the problem. Thus, it is minimal to choose Φ′= ¯ι

        to satisfy this condition. Then we are simply left with

        u ·∇Bψ = 0, which is the choice made in the main text.

        Appendix C: Coordinate independent form of

        quasisymmetric current

        In this appendix we present a brief derivation of (52)

        without using coordinates. Start from,

        B ×u = ∇Φ (C1)

        Rodr´ ıguez et al. 10

        with Φ some single-valued function. With magnetic ﬂux

        surfaces labelled by Φ, to which B as well as the vector

        ﬁeld u are tangent. From this condition, one may also

        write the symmetry vector ﬁeld in the following form,

        u = 1

        |B|2 (CB −B ×∇Φ). (C2)

        Here C = B ·u is some arbitrary function of space, as

        the constraint equation only describes the perpendicular

        part of u.

        Because of magnetostatic equilibrium, we hav

        j ×B = ∇p (C3)

        where j = ∇×B is the plasma current density. Because

        B ·∇p = 0, it follows that p = p(ψ), again assuming

        that B is irrational. By continuity, we argue that this

        must be true for the rational cases as well (provided the

        rotational transform is not constant ι′̸= 0).

        Because the plasma pressure is also a ﬂux function, the

        current densityj also lives on magnetic ﬂux surfaces. Furthermore,
        requiring ∇p to be nowhere vanishing (except

        the axis), the magnetic ﬁeld and the currents are guaranteed to be
        non-collinear. Therefore, at every point on

        a ﬂux surface, we may deﬁne a tangent plane spanned by

        j and B.

        Because u also exists in this subspace, we can express

        it in the chosen basis as,

        u = aj + ˜βB

        Taking the cross product of this form of u with B, and

        making use of (C1) and (C3), it is clear that a= a(ψ) =

        −Φ′/p′, where the prime denotes a derivative with respect to the ﬂux ψ.

        Let us now apply (21), namely ∇·u = 0, on our form

        of u,

        ∇·u = B ·∇˜β = 0 ∴ ˜β = ˜β(ψ)

        so that,

        j = −p′(ψ)

        Φ′ u + β(ψ)B. (C4)

        Using the form (C2), we can then rewrite the expression

        for j in the form,

        j =

        (

        β(ψ) −p′(ψ)C

        B2Φ′

        )

        B + p′(ψ)

        B2 B ×∇ψ (C5)

        The gauge-independent combination C/Φ′= (B ·∇ψ×

        ∇B)/(B ·∇B) here is the same as appears in (25), and

        it is a ﬂux function. There we found a closed form for

        this combination in terms of the covariant representation

        of B in GBC. Equation (C5), together with ∇· B = 0

        and B ·∇ψ = 0 constitute the whole set of governing

        equations for the QS, MHS ﬁeld. The problem in this

        case can be formulated by providing β, C/Φ′ and p as

        input ﬂux functions and solving for B.

        Assuming that we have strong QS, β = −C′/Φ′, obtaining an expression
        equivalent to that obtained in (52).

        The assumption of strong QS was, however, not necessary when working
        with the coordinate representation.

        The adoption of some form of covariant representation

        of B seems to be necessary to obtain such a relation.

        To that end, GBC is convenient, though not necessary.

        In fact, [11] employs the B♭ form (the generalisation of

        the covariant representation of the magnetic ﬁeld), without specifying
        coordinates, to prove (52). Note, however,

        that the diﬀerence in the form of β is merely a matter of

        how the ﬂux function is deﬁned. The problem has β or

        Φ′as free ﬂux function inputs.

        1J. N¨ uhrenberg and R. Zille, Physics Letters A129, 113 (1988).

        2A. H. Boozer, The Physics of Fluids 24, 1999 (1981).

        3E. Rodr´ ıguez, P. Helander, and A. Bhattacharjee, Physics of

        Plasmas 27, 062501 (2020).

        4M. Tessarotto, J. L. Johnson, and L. J. Zheng, Physics of Plasmas 2,
        4499 (1995).

        5M. Tessarotto, J. L. Johnson, R. B. White, and L. Zheng, Physics

        of Plasmas 3, 2653 (1996).

        6A. H. Boozer, Physics of Fluids 26, 496 (1983).

        7D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma

        Physics 3, 2822 (1991).

        8M. Landreman and W. Sengupta, Journal of Plasma Physics 84,

        905840616 (2018).

        9A. Bader, M. Drevlak, D. T. Anderson, B. J. Faber, C. C. Hegna,

        K. M. Likin, J. C. Schmitt, and J. N. Talmadge, Journal of

        Plasma Physics 85, 905850508 (2019).

        10S. Henneberg, M. Drevlak, C. N ˜A¼hrenberg, C. Beidler,

        Y. Turkin, J. Loizu, and P. Helander, Nuclear Fusion 59, 026014

        (2019).

        11J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Physics

        A: Mathematical and Theoretical 54, 125202 (2021).

        12J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Mathematical
        Physics 61, 093503 (2020).

        13J. W. Burby and H. Qin, Physics of Plasmas 20, 012511 (2013).

        14E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,

        012508 (2021).

        15D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma

        Physics 3, 2805 (1991).

        16E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,

        012509 (2021).

        17M. D. Kruskal and R. M. Kulsrud, The Physics of Fluids 1, 265

        (1958).

        18P. Helander, Reports on Progress in Physics 77, 087001 (2014).

        19A more general form J ′ = J ′(ψ,B) could have been demanded.

        However, one may show in that case that the system may always

        be cast into the form here employed.

        20W. A. Newcomb, The Physics of Fluids 2, 362 (1959).

        21W. Sengupta, E. J. Paul, H. Weitzner, and A. Bhattacharjee,

        Journal of Plasma Physics 87, 905870205 (2021).

        22E. J. Paul, T. Antonsen, M. Landreman, and W. A. Cooper,

        Journal of Plasma Physics 86, 905860103 (2020).

        23P. Constantin, T. D. Drivas, and D. Ginsberg, Journal of Plasma

        Physics 87, 905870111 (2021).
      - >-
        [ABSTRACT]

        Real-world time-series datasets are often multivariate with complex
        dynamics. To capture

        this complexity, high capacity architectures like recurrent- or
        attention-based sequential

        deep learning models have become popular. However, recent work
        demonstrates that simple

        univariate linear models can outperform such deep learning models on
        several commonly

        used academic benchmarks. Extending them, in this paper, we investigate
        the capabilities of

        linear models for time-series forecasting and present Time-Series Mixer
        (TSMixer), a novel

        architecture designed by stacking multi-layer perceptrons (MLPs).
        TSMixer is based on

        mixingoperationsalongboththetimeandfeaturedimensionstoextractinformationefficiently.

        On popular academic benchmarks, the simple-to-implement TSMixer is
        comparable to

        specialized state-of-the-art models that leverage the inductive biases
        of specific benchmarks.

        On the challenging and large scale M5 benchmark, a real-world retail
        dataset, TSMixer

        demonstrates superior performance compared to the state-of-the-art
        alternatives. Our results

        underline the importance of efficiently utilizing cross-variate and
        auxiliary information for

        improving the performance of time series forecasting. We present various
        analyses to shed

        light into the capabilities of TSMixer. The design paradigms utilized in
        TSMixer are expected

        to open new horizons for deep learning-based time series forecasting.
        The implementation

        is available
        at:https://github.com/google-research/google-research/tree/master/

        tsmixer.


        [1 Introduction]

        Time series forecasting is a prevalent problem in numerous real-world
        use cases, such as for forecasting of

        demand of products (Böse et al., 2017; Courty & Li, 1999), pandemic
        spread (Zhang & Nawata, 2018), and

        inflation rates (Capistrán et al., 2010). The forecastability of time
        series data often originates from three

        major aspects:


        arXiv:2303.06053v5  [cs.LG]  11 Sep 2023

        Published in Transactions on Machine Learning Research (09/2023)

        Figure 1: TSMixer for multivariate time series forecasting. The columns
        of the inputs means different

        features/variates and the rows are time steps. The fully-connected
        operations are row-wise. TSMixer contains

        interleaving time-mixing and feature-mixing MLPs to aggregate
        information. The number of mixer layer is

        denoted asN. The time-mixing MLPs are shared across all features and the
        feature-mixing MLPs are shared

        across all of the time steps. The design allow TSMixer to automatically
        adapt the use of both temporal and

        cross-variate information with limited number of parameters for superior
        generalization. The extension with

        auxiliary information is also explored in this paper.

        • Persistent temporal patterns: encompassing trends and seasonal
        patterns, e.g., long-term inflation,

        day-of-week effects;

        • Cross-variate information: correlations between different variables,
        e.g., an increase in blood pressure

        associated with a rise in body weight;

        • Auxiliary features: comprising static features and future information,
        e.g., product categories and

        promotional events.


        Published in Transactions on Machine Learning Research (09/2023)

        Traditional models, such as ARIMA (Box et al., 1970), are designed for
        univariate time series, where only

        temporal information is available. Therefore, they face limitations when
        dealing with challenging real-world

        data, which often contains complex cross-variate information and
        auxiliary features. In contrast, numerous

        deep learning models, particularly Transformer-based models, have been
        proposed due to their capacity to

        capture both complex temporal patterns and cross-variate dependencies
        (Gamboa, 2017; Li et al., 2019; Wen

        et al., 2017; Zhou et al., 2021; Wu et al., 2021; Lim & Zohren, 2021;
        Liu et al., 2022a; Zhou et al., 2022b; Liu

        et al., 2022b; Zhou et al., 2022a) .

        The natural intuition is that multivariate models, such as those based
        on Transformer architectures, should be

        more effective than univariate models due to their ability to leverage
        cross-variate information. However, Zeng

        et al. (2023) revealed that this is not always the case –
        Transformer-based models can indeed be significantly

        worse than simple univariate temporal linear models on many commonly
        used forecasting benchmarks. The

        multivariate models seem to suffer from overfitting especially when the
        target time series is not correlated

        with other covariates. This surprising finding has raised two essential
        questions:

        1. Does cross-variate information truly provide a benefit for time
        series forecasting?

        2. When cross-variate information is not beneficial, can multivariate
        models still perform as well as

        univariate models?

        To address these questions, we begin by analyzing the effectiveness of
        temporal linear models. Our findings

        indicate that their time-step-dependent characteristics render temporal
        linear models great candidates for

        learning temporal patterns under common assumptions. Consequently, we
        gradually increase the capacity of

        linear models by

        1. stacking temporal linear models with non-linearities (TMix-Only),

        2. introducing cross-variate feed-forward layers (TSMixer).

        The resulting TSMixer alternatively applies MLPs across time and feature
        dimensions, conceptually corresponding totime-mixing and feature-mixing
        operations, efficiently capturing both temporal patterns and

        cross-variate information, as illustrated in Fig. 1. The residual
        designs ensure that TSMixer retains the

        capacity of temporal linear models while still being able to exploit
        cross-variate information.

        We evaluate TSMixer on commonly used long-term forecasting datasets (Wu
        et al., 2021) where univariate

        models have outperformed multivariate models. Our ablation study
        demonstrates the effectiveness of stacking

        temporal linear models and validates that cross-variate information is
        less beneficial on these popular datasets,

        explaining the superior performance of univariate models. Even so,
        TSMixer is on par with state-of-the-art

        univariate models and significantly outperforms other multivariate
        models.

        To demonstrate the benefit of multivariate models, we further evaluate
        TSMixer on the challenging M5

        benchmark, a large-scale retail dataset used in the M-competition
        (Makridakis et al., 2022). M5 contains crucial

        cross-variate interactions such as sell prices (Makridakis et al.,
        2022). The results show that cross-variate

        information indeed brings significant improvement, and TSMixer can
        effectively leverage this information.

        Furthermore, we propose a principle design to extend TSMixer to handle
        auxiliary information such as static

        features and future time-varying features. It aligns the different types
        of features into the same shape then

        applied mixer layers on the concatenated features to leverage the
        interactions between them. In this more

        practical and challenging setting, TSMixer outperforms models that are
        popular in industrial applications,

        including DeepAR (Salinas et al. 2020, Amazon SageMaker) and TFT (Lim et
        al. 2021, Google Cloud Vertex),

        demonstrating its strong potential for real world impact.

        We summarize our contributions as below:

        • We analyze the effectiveness of state-of-the-art linear models and
        indicate that their time-stepdependent characteristics make them great
        candidates for learning temporal patterns under common

        assumptions.


        Published in Transactions on Machine Learning Research (09/2023)

        Table 1: Recent works in time series forecasting. Category I is
        univariate time series forecasting; Category II

        is multivariate time series forecasting, and Category III is time series
        forecasting with auxiliary information.

        In this work, we propose TSMixer for Category II. We also extend TSMixer
        to leverage auxiliary information

        including static and future time-varying features for Category III.

        Category

        Extrapolating Consideration of Consideration of

        Modelstemporal patternscross-variate informationauxiliary features

        (i.e. multivariateness)

        I ✔

        ARIMA (Box et al., 1970)

        N-BEATS (Oreshkin et al., 2020)

        LTSF-Linear (Zeng et al., 2023)

        PatchTST (Nie et al., 2023)

        II ✔ ✔

        Informer (Zhou et al., 2021)

        Autoformer (Wu et al., 2021)

        Pyraformer (Liu et al., 2022a)

        FEDformer (Zhou et al., 2022b)

        NS-Transformer (Liu et al., 2022b)

        FiLM (Zhou et al., 2022a)

        TSMixer(this work)

        III ✔ ✔ ✔

        MQRNN (Wen et al., 2017)

        DSSM (Rangapuram et al., 2018)

        DeepAR (Salinas et al., 2020)

        TFT (Lim et al., 2021)

        TSMixer-Ext(this work)

        • We propose TSMixer, an innovative architecture which retains the
        capacity of linear models to

        capture temporal patterns while still being able to exploit
        cross-variate information.

        • We point out the potential risk of evaluating multivariate models on
        common long-term forecasting

        benchmarks.

        • Our empirical studies demonstrate that TSMixer is the first
        multivariate model which is on par with

        univariate models on common benchmarks and achieves state-of-the-art on
        a large-scale industrial

        application where cross-variate information is crucial.


        [4 Tsmixer Architecture]

        Expanding upon our finding that linear models can serve as strong
        candidates for capturing time dependencies,

        we initially propose a natural enhancement by stacking linear models
        with non-linearities to form multi-layer

        perceptrons (MLPs). Common deep learning techniques, such as
        normalization and residual connections, are

        applied to facilitate efficient learning. However, this architecture
        does not take cross-variate information into

        account.

        To better leverage cross-variate information, we propose the application
        of MLPs in the time-domain and

        the feature-domain in an alternating manner. The time-domain MLPs are
        shared across all of the features,

        while the feature-domain MLPs are shared across all of the time steps.
        This resulting model is akin to the

        MLP-Mixer architecture from computer vision (Tolstikhin et al., 2021),
        with time-domain and feature-domain


        Published in Transactions on Machine Learning Research (09/2023)

        wt−2 wt−1 wt ft−2(x)ft−1(x) ft(x)

        xt−2 xt−1 xt

        xt+1

        xt+1 = ∑t

        i=1 wixi

        Time-step-dependent

        xt−2 xt−1 xt

        xt+1

        xt+1 = ∑t

        i=1 fi(x)xi

        Data-dependent

        Figure 2: Illustrations of time-step-dependent and data-dependent models
        within a single forecasting time

        step.

        Figure 3: The architecture of TMix-Only. It is similar to TSMixer but
        only applies time-mixing.

        operations representing time-mixing and feature-mixing operations,
        respectively. Consequently, we name our

        proposed architecture Time-Series Mixer (TSMixer).

        The interleaving design between these two operations efficiently
        utilizes both temporal dependencies and

        cross-variate information while limiting computational complexity and
        model size. It allows TSMixer to use a

        long lookback window (see Sec. 3), while maintaining the parameter
        growth in onlyO(L+ C) instead of

        O(LC) if fully-connected MLPs were used. To better understand the
        utility of cross-variate information and

        feature-mixing, we also consider a simplified variant of TSMixer that
        only employs time-mixing, referred to

        as TMix-Only, which consists of a residual MLP shared across each
        variate, as illustrated in Fig. 3. We also

        present the extension of TSMixer to scenarios where auxiliary
        information about the time series is available.


        [6 Conclusions]

        We propose TSMixer, a novel architecture for time series forecasting
        that is designed using MLPs instead of

        commonly used RNNs and attention mechanisms to obtain superior
        generalization with a simple architecture.

        Our results at a wide range of real-world time series forecasting tasks
        demonstrate that TSMixer is highly

        effective in both long-term forecasting benchmarks for multivariate
        time-series, and real-world large-scale

        retail demand forecasting tasks. Notably, TSMixer is the only
        multivariate model that is able to achieve

        similar performance to univariate models in long term time series
        forecasting benchmarks. The TSMixer

        architecture has significant potential for further improvement and we
        believe it will be useful in a wide

        range of time series forecasting tasks. Some of the potential future
        works include further exploring the

        interpretability of TSMixer, as well as its scalability to even larger
        datasets. We hope this work will pave the

        way for more innovative architectures for time series forecasting.


        Published in Transactions on Machine Learning Research (09/2023)
      - >-
        [ABSTRACT]

        Low-dimensional embeddings of nodes in large graphs have proved
        extremely

        useful in a variety of prediction tasks, from content recommendation to
        identifying

        protein functions. However, most existing approaches require that all
        nodes in the

        graph are present during training of the embeddings; these previous
        approaches are

        inherently transductive and do not naturally generalize to unseen nodes.
        Here we

        present GraphSAGE, a general inductive framework that leverages node
        feature

        information (e.g., text attributes) to efﬁciently generate node
        embeddings for

        previously unseen data. Instead of training individual embeddings for
        each node,

        we learn a function that generates embeddings by sampling and
        aggregating features

        from a node’s local neighborhood. Our algorithm outperforms strong
        baselines

        on three inductive node-classiﬁcation benchmarks: we classify the
        category of

        unseen nodes in evolving information graphs based on citation and Reddit
        post

        data, and we show that our algorithm generalizes to completely unseen
        graphs

        using a multi-graph dataset of protein-protein interactions.


        [1 Introduction]

        Low-dimensional vector embeddings of nodes in large graphs 1 have proved
        extremely useful as

        feature inputs for a wide variety of prediction and graph analysis tasks
        [5, 11, 28, 35, 36]. The basic

        idea behind node embedding approaches is to use dimensionality reduction
        techniques to distill the

        high-dimensional information about a node’s graph neighborhood into a
        dense vector embedding.

        These node embeddings can then be fed to downstream machine learning
        systems and aid in tasks

        such as node classiﬁcation, clustering, and link prediction [11, 28,
        35].

        However, previous works have focused on embedding nodes from a single
        ﬁxed graph, and many

        real-world applications require embeddings to be quickly generated for
        unseen nodes, or entirely new

        (sub)graphs. This inductive capability is essential for high-throughput,
        production machine learning

        systems, which operate on evolving graphs and constantly encounter
        unseen nodes (e.g., posts on

        Reddit, users and videos on Youtube). An inductive approach to
        generating node embeddings also

        facilitates generalization across graphs with the same form of features:
        for example, one could train

        an embedding generator on protein-protein interaction graphs derived
        from a model organism, and

        then easily produce node embeddings for data collected on new organisms
        using the trained model.

        The inductive node embedding problem is especially difﬁcult, compared to
        the transductive setting,

        because generalizing to unseen nodes requires “aligning” newly observed
        subgraphs to the node

        embeddings that the algorithm has already optimized on. An inductive
        framework must learn to

        ∗The two ﬁrst authors made equal contributions.

        1While it is common to refer to these data structures as social or
        biological networks, we use the term graph

        to avoid ambiguity with neural network terminology.

        31st Conference on Neural Information Processing Systems (NIPS 2017),
        Long Beach, CA, USA.

        arXiv:1706.02216v4  [cs.SI]  10 Sep 2018

        Figure 1: Visual illustration of the GraphSAGE sample and aggregate
        approach.

        recognize structural properties of a node’s neighborhood that reveal
        both the node’s local role in the

        graph, as well as its global position.

        Most existing approaches to generating node embeddings are inherently
        transductive. The majority

        of these approaches directly optimize the embeddings for each node using
        matrix-factorization-based

        objectives, and do not naturally generalize to unseen data, since they
        make predictions on nodes in a

        single, ﬁxed graph [5, 11, 23, 28, 35, 36, 37, 39]. These approaches can
        be modiﬁed to operate in an

        inductive setting (e.g., [28]), but these modiﬁcations tend to be
        computationally expensive, requiring

        additional rounds of gradient descent before new predictions can be
        made. There are also recent

        approaches to learning over graph structures using convolution operators
        that offer promise as an

        embedding methodology [17]. So far, graph convolutional networks (GCNs)
        have only been applied

        in the transductive setting with ﬁxed graphs [17, 18]. In this work we
        both extend GCNs to the task

        of inductive unsupervised learning and propose a framework that
        generalizes the GCN approach to

        use trainable aggregation functions (beyond simple convolutions).

        Present work. We propose a general framework, called GraphSAGE (SAmple
        and aggreGatE), for

        inductive node embedding. Unlike embedding approaches that are based on
        matrix factorization,

        we leverage node features (e.g., text attributes, node proﬁle
        information, node degrees) in order to

        learn an embedding function that generalizes to unseen nodes. By
        incorporating node features in the

        learning algorithm, we simultaneously learn the topological structure of
        each node’s neighborhood

        as well as the distribution of node features in the neighborhood. While
        we focus on feature-rich

        graphs (e.g., citation data with text attributes, biological data with
        functional/molecular markers), our

        approach can also make use of structural features that are present in
        all graphs (e.g., node degrees).

        Thus, our algorithm can also be applied to graphs without node features.

        Instead of training a distinct embedding vector for each node, we train
        a set of aggregator functions

        that learn to aggregate feature information from a node’s local
        neighborhood (Figure 1). Each

        aggregator function aggregates information from a different number of
        hops, or search depth, away

        from a given node. At test, or inference time, we use our trained system
        to generate embeddings for

        entirely unseen nodes by applying the learned aggregation functions.
        Following previous work on

        generating node embeddings, we design an unsupervised loss function that
        allows GraphSAGE to be

        trained without task-speciﬁc supervision. We also show that GraphSAGE
        can be trained in a fully

        supervised manner.

        We evaluate our algorithm on three node-classiﬁcation benchmarks, which
        test GraphSAGE’s ability

        to generate useful embeddings on unseen data. We use two evolving
        document graphs based on

        citation data and Reddit post data (predicting paper and post
        categories, respectively), and a multigraph generalization experiment
        based on a dataset of protein-protein interactions (predicting protein

        functions). Using these benchmarks, we show that our approach is able to
        effectively generate

        representations for unseen nodes and outperform relevant baselines by a
        signiﬁcant margin: across

        domains, our supervised approach improves classiﬁcation F1-scores by an
        average of 51% compared

        to using node features alone and GraphSAGE consistently outperforms a
        strong, transductive baseline

        [28], despite this baseline taking ∼100×longer to run on unseen nodes.
        We also show that the new

        aggregator architectures we propose provide signiﬁcant gains (7.4% on
        average) compared to an

        aggregator inspired by graph convolutional networks [17]. Lastly, we
        probe the expressive capability

        of our approach and show, through theoretical analysis, that GraphSAGE
        is capable of learning

        structural information about a node’s role in a graph, despite the fact
        that it is inherently based on

        features (Section 5).


        [3.3 Aggregator Architectures]

        Unlike machine learning over N-D lattices (e.g., sentences, images, or
        3-D volumes), a node’s

        neighbors have no natural ordering; thus, the aggregator functions in
        Algorithm 1 must operate over

        an unordered set of vectors. Ideally, an aggregator function would be
        symmetric (i.e., invariant to

        permutations of its inputs) while still being trainable and maintaining
        high representational capacity.

        The symmetry property of the aggregation function ensures that our
        neural network model can

        be trained and applied to arbitrarily ordered node neighborhood feature
        sets. We examined three

        candidate aggregator functions:

        Mean aggregator. Our ﬁrst candidate aggregator function is the mean
        operator, where we simply

        take the elementwise mean of the vectors in {hk−1

        u ,∀u ∈N(v)}. The mean aggregator is nearly

        equivalent to the convolutional propagation rule used in the
        transductive GCN framework [17]. In

        particular, we can derive an inductive variant of the GCN approach by
        replacing lines 4 and 5 in

        Algorithm 1 with the following:4

        hk

        v ←σ(W ·MEAN ({hk−1

        v }∪{hk−1

        u ,∀u∈N(v)}). (2)

        We call this modiﬁed mean-based aggregatorconvolutional since it is a
        rough, linear approximation of

        a localized spectral convolution [17]. An important distinction between
        this convolutional aggregator

        and our other proposed aggregators is that it does not perform the
        concatenation operation in line

        5 of Algorithm 1—i.e., the convolutional aggregator does concatenate the
        node’s previous layer

        representation hk−1

        v with the aggregated neighborhood vector hk

        N(v). This concatenation can be

        viewed as a simple form of a “skip connection” [13] between the
        different “search depths”, or “layers”

        of the GraphSAGE algorithm, and it leads to signiﬁcant gains in
        performance (Section 4).

        LSTM aggregator. We also examined a more complex aggregator based on an
        LSTM architecture

        [14]. Compared to the mean aggregator, LSTMs have the advantage of
        larger expressive capability.

        However, it is important to note that LSTMs are not inherently symmetric
        (i.e., they are not permutation invariant), since they process their
        inputs in a sequential manner. We adapt LSTMs to operate on

        an unordered set by simply applying the LSTMs to a random permutation of
        the node’s neighbors.

        3Exploring non-uniform samplers is an important direction for future
        work.

        4Note that this differs from Kipf et al’s exact equation by a minor
        normalization constant [17].


        Pooling aggregator. The ﬁnal aggregator we examine is both symmetric and
        trainable. In this

        pooling approach, each neighbor’s vector is independently fed through a
        fully-connected neural

        network; following this transformation, an elementwise max-pooling
        operation is applied to aggregate

        information across the neighbor set:

        AGGREGATE pool

        k = max({σ

        (

        Wpoolhk

        ui + b

        )

        ,∀ui ∈N(v)}), (3)

        where max denotes the element-wise max operator and σ is a nonlinear
        activation function. In

        principle, the function applied before the max pooling can be an
        arbitrarily deep multi-layer perceptron, but we focus on simple
        single-layer architectures in this work. This approach is inspired by

        recent advancements in applying neural network architectures to learn
        over general point sets [29].

        Intuitively, the multi-layer perceptron can be thought of as a set of
        functions that compute features for

        each of the node representations in the neighbor set. By applying the
        max-pooling operator to each of

        the computed features, the model effectively captures different aspects
        of the neighborhood set. Note

        also that, in principle, any symmetric vector function could be used in
        place of the max operator

        (e.g., an element-wise mean). We found no signiﬁcant difference between
        max- and mean-pooling in

        developments test and thus focused on max-pooling for the rest of our
        experiments.


        [6 Conclusion]

        We introduced a novel approach that allows embeddings to be efﬁciently
        generated for unseen nodes.

        GraphSAGE consistently outperforms state-of-the-art baselines,
        effectively trades off performance

        and runtime by sampling node neighborhoods, and our theoretical analysis
        provides insight into

        how our approach can learn about local graph structures. A number of
        extensions and potential

        improvements are possible, such as extending GraphSAGE to incorporate
        directed or multi-modal

        graphs. A particularly interesting direction for future work is
        exploring non-uniform neighborhood

        sampling functions, and perhaps even learning these functions as part of
        the GraphSAGE optimization.

        Acknowledgments

        The authors thank Austin Benson, Aditya Grover, Bryan He, Dan Jurafsky,
        Alex Ratner, Marinka

        Zitnik, and Daniel Selsam for their helpful discussions and comments on
        early drafts. The authors

        would also like to thank Ben Johnson for his many useful questions and
        comments on our code and

        Nikhil Mehta and Yuhui Ding for catching some minor errors in a previous
        version of the appendix.

        This research has been supported in part by NSF IIS-1149837, DARPA
        SIMPLEX, Stanford Data

        Science Initiative, Huawei, and Chan Zuckerberg Biohub. WLH was also
        supported by the SAP

        Stanford Graduate Fellowship and an NSERC PGS-D grant. The views and
        conclusions expressed

        in this material are those of the authors and should not be interpreted
        as necessarily representing

        the ofﬁcial policies or endorsements, either expressed or implied, of
        the above funding agencies,

        corporations, or the U.S. and Canadian governments.
  - source_sentence: How can I use transformer models for detailed image analysis?
    sentences:
      - >-
        [ABSTRACT]

        We consider matrix completion for recommender systems from the point of
        view of

        link prediction on graphs. Interaction data

        such as movie ratings can be represented by a

        bipartite user-item graph with labeled edges

        denoting observed ratings. Building on recent

        progress in deep learning on graph-structured

        data, we propose a graph auto-encoder framework based on diﬀerentiable
        message passing

        on the bipartite interaction graph. Our model

        shows competitive performance on standard

        collaborative ﬁltering benchmarks. In settings

        where complimentary feature information or

        structured data such as a social network is

        available, our framework outperforms recent

        state-of-the-art methods.


        [1 Introduction]

        With the explosive growth of e-commerce and social

        media platforms, recommendation algorithms have become indispensable
        tools for many businesses. Two

        main branches of recommender algorithms are often

        distinguished: content-based recommender systems [24]

        and collaborative ﬁltering models [9]. Content-based

        recommender systems use content information of users

        and items, such as their respective occupation and

        genre, to predict the next purchase of a user or rating of an item.
        Collaborative ﬁltering models solve

        the matrix completion task by taking into account the

        collective interaction data to predict future ratings or

        purchases.

        In this work, we view matrix completion as a link prediction problem on
        graphs: the interaction data in

        collaborative ﬁltering can be represented by a bipartite graph between
        user and item nodes, with observed

        ratings/purchases represented by links. Content information can
        naturally be included in this framework

        1Canadian Institute for Advanced Research

        in the form of node features. Predicting ratings then

        reduces to predicting labeled links in the bipartite useritem graph.

        We propose graph convolutional matrix completion

        (GC-MC): a graph-based auto-encoder framework for

        matrix completion, which builds on recent progress

        in deep learning on graphs [2, 6, 19, 5, 15, 30, 14].

        The auto-encoder produces latent features of user and

        item nodes through a form of message passing on the

        bipartite interaction graph. These latent user and item

        representations are used to reconstruct the rating links

        through a bilinear decoder.

        The beneﬁt of formulating matrix completion as a link

        prediction task on a bipartite graph becomes especially

        apparent when recommender graphs are accompanied

        with structured external information such as social

        networks. Combining such external information with

        interaction data can alleviate performance bottlenecks

        related to the cold start problem. We demonstrate that

        our graph auto-encoder model eﬃciently combines interaction data with
        side information, without resorting

        to recurrent frameworks as in [22].

        The paper is structured as follows: in Section 2 we

        introduce our graph auto-encoder model for matrix

        completion. Section 3 discusses related work. Experimental results are
        shown in Section 4, and conclusion

        and future research directions are discussed in Section

        5.


        [2.4 Model Training]

        Loss function During model training, we minimize

        the following negative log likelihood of the predicted

        ratings ˇMij:

        L= −

        ∑

        i,j;Ωij=1

        R∑

        r=1

        I[r= Mij] logp( ˇMij = r) , (6)

        with I[k = l] = 1 when k = l and zero otherwise.

        The matrix Ω ∈ {0,1}Nu×Ni serves as a mask for

        unobserved ratings, such that ones occur for elements

        corresponding to observed ratings in M, and zeros

        for unobserved ratings. Hence, we only optimize over

        observed ratings.

        Node dropout In order for the model to generalize

        well to unobserved ratings, it is trained in a denoising

        setup by randomly dropping out all outgoing messages

        of a particular node, with a probabilitypdropout, which

        we will refer to asnode dropout. Messages are rescaled

        after dropout as in [28]. In initial experiments we found

        that node dropout was more eﬃcient in regularizing

        than message dropout. In the latter case individual

        outgoing messages are dropped out independently, making embeddings more
        robust against the presence or

        absence of single edges. In contrast, node dropout also

        causes embeddings to be more independent of particular user or item
        inﬂuences. We furthermore also apply

        regular dropout [28] to the hidden layer units (3).

        Mini-batching We introduce mini-batching by sampling contributions to
        the loss function in Eq.(6) from

        diﬀerent observed ratings. That is, we sample only a

        ﬁxed number of contributions from the sum over user

        and item pairs. By only considering a ﬁxed number

        of contributions to the loss function, we can remove

        respective rows of users and items inM1,...,M R in

        Eq. (7) that do not appear in the current batch. This

        serves both as an eﬀective means of regularization, and

        reduces the memory requirement to train the model,

        which is necessary to ﬁt the full model for MovieLens10M into GPU
        memory. We experimentally veriﬁed

        that training with mini-batches and full batches leads

        to comparable results for the MovieLens-1M dataset

        while adjusting for regularization parameters. For all

        datasets except for the MovieLens-10M, we opt for fullbatch training
        since it leads to faster convergence than

        training with mini-batches in this particular setting.


        [5 Conclusions]

        In this work, we have introduced graph convolutional

        matrix completion (GC-MC): a graph auto-encoder

        framework for the matrix completion task in recommender systems. The
        encoder contains a graph convolution layer that constructs user and item
        embeddings

        through message passing on the bipartite user-item

        interaction graph. Combined with a bilinear decoder,

        new ratings are predicted in the form of labeled edges.

        The graph auto-encoder framework naturally generalizes to include side
        information for both users and

        items. In this setting, our proposed model outperforms recent related
        methods by a large margin, as

        demonstrated on a number of benchmark datasets with

        feature- and graph-based side information. We further

        show that our model can be trained on larger scale

        datasets through stochastic mini-batching. In this setting, our model
        achieves results that are competitive

        with recent state-of-the-art collaborative ﬁltering.

        In future work, we wish to extend this model to largescale multi-modal
        data (comprised of text, images, and

        other graph-based information), such as present in

        many realisitic recommendation platforms. In such

        a setting, the GC-MC model can be combined with

        recurrent (for text) or convolutional neural networks

        (for images). To address scalability, it is necessary to

        develop eﬃcient approximate schemes, such as subsampling local
        neighborhoods [10]. Finally, attention

        mechanisms [1] provide a promising future avenue for

        extending this class of models.

        Acknowledgments

        We would like to thank Jakub Tomczak, Christos

        Louizos, Karen Ullrich and Peter Bloem for helpful

        discussions and comments. This project is supported

        by the SAP Innovation Center Network.
      - >-
        [ABSTRACT]

        Fully Convolutional Neural Networks (FCNNs) with contracting and
        expanding paths have shown prominence for the

        majority of medical image segmentation applications since

        the past decade. In FCNNs, the encoder plays an integral

        role by learning both global and local features and contextual

        representations which can be utilized for semantic output

        prediction by the decoder. Despite their success, the locality

        of convolutional layers in FCNNs, limits the capability of

        learning long-range spatial dependencies. Inspired by the

        recent success of transformers for Natural Language Processing (NLP) in
        long-range sequence learning, we reformulate

        the task of volumetric (3D) medical image segmentation as

        a sequence-to-sequence prediction problem. We introduce a

        novel architecture, dubbed as UNEt TRansformers (UNETR),

        that utilizes a transformer as the encoder to learn sequence

        representations of the input volume and effectively capture

        the global multi-scale information, while also following the

        successful “U-shaped” network design for the encoder and

        decoder. The transformer encoder is directly connected to

        a decoder via skip connections at different resolutions to

        compute the ﬁnal semantic segmentation output. We have

        validated the performance of our method on the Multi Atlas

        Labeling Beyond The Cranial Vault (BTCV) dataset for multiorgan
        segmentation and the Medical Segmentation Decathlon

        (MSD) dataset for brain tumor and spleen segmentation tasks.

        Our benchmarks demonstrate new state-of-the-art performance on the BTCV
        leaderboard.

        Code: https://monai.io/research/unetr


        [1. Introduction]

        Image segmentation plays an integral role in quantitative

        medical image analysis as it is often the ﬁrst step for analysis

        of anatomical structures [33]. Since the advent of deep learning, FCNNs
        and in particular “U-shaped“ encoder-decoder arTransformer  Encoder

        Linear Projection of Flattened Patches

        9864321 5 70 *

        Decoder

        𝐻×𝑊 ×𝐷 ×𝐶

        Segmentation

        Output

        3D Patches

        Figure 1. Overview of UNETR. Our proposed model consists

        of a transformer encoder that directly utilizes 3D patches and is

        connected to a CNN-based decoder via skip connection.

        chitectures [22, 23, 21] have achieved state-of-the-art results

        in various medical semantic segmentation tasks [2, 38, 19]. In

        a typical U-Net [36] architecture, the encoder is responsible

        for learning global contextual representations by gradually

        downsampling the extracted features, while the decoder upsamples the
        extracted representations to the input resolution

        for pixel/voxel-wise semantic prediction. In addition, skip

        connections merge the output of the encoder with decoder

        at different resolutions, hence allowing for recovering spatial

        information that is lost during downsampling.

        Although such FCNN-based approaches have powerful

        representation learning capabilities, their performance in

        learning long-range dependencies is limited to their localized

        receptive ﬁelds [ 20, 35]. As a result, such a deﬁciency

        in capturing multi-scale information leads to sub-optimal

        segmentation of structures with variable shapes and scales

        (e.g. brain lesions with different sizes). Several efforts have

        used atrous convolutional layers [9, 27, 18] to enlarge the


        arXiv:2103.10504v3  [eess.IV]  9 Oct 2021

        receptive ﬁelds. However, locality of the receptive ﬁelds in

        convolutional layers still limits their learning capabilities to

        relatively small regions. Combining self-attention modules

        with convolutional layers [45, 50, 16] has been proposed to

        improve the non-local modeling capability.

        In Natural Language Processing (NLP), transformer-based

        models [ 42, 13] achieve state-of-the-art benchmarks in

        various tasks. The self-attention mechanism of transformers

        allows to dynamically highlight the important features of

        word sequences. Additionally, in computer vision, using

        transformers as a backbone encoder is beneﬁcial due to their

        great capability of modeling long-range dependencies and

        capturing global context [14, 4]. Speciﬁcally, unlike the local

        formulation of convolutions, transformers encode images as

        a sequence of 1D patch embeddings and utilize self-attention

        modules to learn a weighted sum of values that are calculated

        from hidden layers. As a result, this ﬂexible formulation

        allows to effectively learn the long-range information.

        Furthermore, Vision Transformer (ViT) [14] and its variants

        have shown excellent capabilities in learning pre-text tasks

        that can be transferred to down-stream applications [40, 6, 3].

        In this work, we propose to leverage the power of

        transformers for volumetric medical image segmentation and

        introduce a novel architecture dubbed as UNEt TRansformers

        (UNETR). In particular, we reformulate the task of 3D segmentation as a
        1D sequence-to-sequence prediction problem

        and use a transformer as the encoder to learn contextual

        information from the embedded input patches. The extracted

        representations from the transformer encoder are merged

        with the CNN-based decoder via skip connections at multiple

        resolutions to predict the segmentation outputs. Instead of

        using transformers in the decoder, our proposed framework

        uses a CNN-based decoder. This is due to the fact that transformers are
        unable to properly capture localized information,

        despite their great capability of learning global information.

        We validate the effectiveness of our method on 3D CT

        and MRI segmentation tasks using Beyond the Cranial

        Vault (BTCV) [ 26] and Medical Segmentation Decathlon

        (MSD) [ 38] datasets. In BTCV dataset, UNETR achieves

        new state-of-the-art performance on both Standard and

        Free Competition sections on its leaderboard. UNETR

        outperforms the state-of-the-art methodologies on both brain

        tumor and spleen segmentation tasks in MSD dataset.

        our main contributions of this work are as follows::

        • We propose a novel transformer-based model for

        volumetric medical image segmentation.

        • To this end, we propose a novel architecture in which (1)

        a transformer encoder directly utilizes the embedded 3D

        volumes to effectively capture long-range dependencies;

        (2) a skip-connected decoder combines the extracted

        representations at different resolutions and predicts the

        segmentation output.

        • We validate the effectiveness of our proposed model for

        different volumetric segmentation tasks on two public

        datasets: BTCV [26] and MSD [38]. UNETR achieves

        new state-of-the-art performance on leaderboard of

        BTCV dataset and outperforms competing approaches

        on the MSD dataset.


        [3.1. Architecture]

        We have presented an overview of the proposed model

        in Fig. 2. UNETR utilizes a contracting-expanding pattern consisting of
        a stack of transformers as the encoder

        which is connected to a decoder via skip connections. As

        commonly used in NLP, the transformers operate on 1D

        sequence of input embeddings. Similarly, we create a 1D

        sequence of a 3D input volume x ∈RH×W×D×C with

        resolution (H,W,D) and C input channels by dividing it into

        ﬂattened uniform non-overlapping patches xv ∈RN×(P3.C)

        where (P, P, P) denotes the resolution of each patch and

        N =(H×W ×D)/P3 is the length of the sequence.

        Subsequently, we use a linear layer to project the patches

        into a K dimensional embedding space, which remains

        constant throughout the transformer layers. In order to

        preserve the spatial information of the extracted patches, we

        add a 1D learnable positional embeddingEpos ∈RN×K to

        the projected patch embeddingE∈R(P3.C)×K according to

        z0 =[x1

        vE;x2

        vE;...;xN

        v E]+Epos, (1)

        Note that the learnable [class] token is not added to

        the sequence of embeddings since our transformer backbone

        is designed for semantic segmentation. After the embedding

        layer, we utilize a stack of transformer blocks [42, 14] comprising of
        multi-head self-attention (MSA) and multilayer

        perceptron (MLP) sublayers according to

        z′

        i =MSA(Norm(zi−1))+zi−1, i =1...L, (2)

        zi =MLP(Norm(z′

        i))+z′

        i, i =1...L, (3)

        where Norm() denotes layer normalization [1], MLP comprises of two
        linear layers with GELU activation functions,

        i is the intermediate block identiﬁer, andL is the number of

        transformer layers.

        A MSA sublayer comprises of n parallel self-attention

        (SA) heads. Speciﬁcally, the SA block, is a parameterized

        function that learns the mapping between a query ( q) and

        the corresponding key ( k) and value ( v) representations

        in a sequence z ∈RN×K. The attention weights ( A) are

        computed by measuring the similarity between two elements

        in z and their key-value pairs according to

        A=Softmax( qk⊤

        √Kh

        ), (4)

        where Kh = K/n is a scaling factor for maintaining the

        number of parameters to a constant value with different

        values of the keyk. Using the computed attention weights,

        the output ofSA for valuesv in the sequencez is computed as

        SA(z)= Av, (5)

        Here, v denotes the values in the input sequence and

        Kh = K/n is a scaling factor. Furthermore, the output of

        MSA is deﬁned as

        MSA(z)=[ SA1(z);SA2(z);...;SAn(z)]Wmsa, (6)

        where Wmsa ∈ Rn.Kh×K represents the multi-headed

        trainable parameter weights.

        Inspired by architectures that are similar to U-Net [ 36],

        where features from multiple resolutions of the encoder are

        merged with the decoder, we extract a sequence representation zi (i
        ∈{3,6,9,12}), with size H×W×D

        P3 ×K, from the

        transformer and reshape them into aH

        P ×W

        P ×D

        P ×K tensor.

        A representation in our deﬁnition is in the embedding space

        after it has been reshaped as an output of the transformer

        with feature size of K (i.e. transformer’s embedding size).

        Furthermore, as shown in Fig. 2, at each resolution we project

        the reshaped tensors from the embedding space into the input

        space by utilizing consecutive3×3×3 convolutional layers

        that are followed by normalization layers.

        At the bottleneck of our encoder (i.e. output of transformer’s last
        layer), we apply a deconvolutional layer to the

        transformed feature map to increase its resolution by a factor

        of 2. We then concatenate the resized feature map with the


        Embedded

        Patches

        Norm

        Multi-Head

        Attention

        +

        Norm

        MLP

        +

        Linear

        Projection

        𝑧3

        C

        C

        C

        C

        Conv 3×3×3, BN, ReLU

        Deconv 2×2×2, Conv 3×3×3, BN, ReLU

        Deconv 2×2×2

        Conv 1×1×1

        𝑧6

        𝑧9

        𝑧12

        ×12

        𝐻

        16× 𝑊

        16× 𝐷

        16×768

        𝐻

        16× 𝑊

        16× 𝐷

        16×768

        𝐻

        16× 𝑊

        16× 𝐷

        16×768

        𝐻

        16× 𝑊

        16× 𝐷

        16×768

        𝐻×𝑊 ×𝐷 ×4

        𝐻

        4 ×𝑊

        4 ×𝐷

        4 ×256

        𝐻

        8 ×𝑊

        8 ×𝐷

        8 ×512

        𝐻

        2 ×𝑊

        2 ×𝐷

        2 ×128

        𝐻

        2 ×𝑊

        2 ×𝐷

        2 ×128

        𝐻

        4 ×𝑊

        4 ×𝐷

        4 ×256

        𝐻×𝑊 ×𝐷 ×64

        𝐻×𝑊 ×𝐷 ×64

        Input 

        𝐻×𝑊 ×𝐷 ×3

        Output

        Figure 2. Overview of UNETR architecture. A 3D input volume (e.g.C=4
        channels for MRI images), is divided into a sequence of uniform

        non-overlapping patches and projected into an embedding space using a
        linear layer. The sequence is added with a position embedding and

        used as an input to a transformer model. The encoded representations of
        different layers in the transformer are extracted and merged with a

        decoder via skip connections to predict the ﬁnal segmentation. Output
        sizes are given for patch resolutionP =16 and embedding sizeK=768.

        feature map of the previous transformer output (e.g.z9), and

        feed them into consecutive 3 ×3 ×3 convolutional layers

        and upsample the output using a deconvolutional layer. This

        process is repeated for all the other subsequent layers up

        to the original input resolution where the ﬁnal output is fed

        into a 1×1×1 convolutional layer with a softmax activation

        function to generate voxel-wise semantic predictions.


        [7. Conclusion]

        This paper introduces a novel transformer-based architecture, dubbed as
        UNETR, for semantic segmentation of

        volumetric medical images by reformulating this task as a 1D

        sequence-to-sequence prediction problem. We proposed to

        use a transformer encoder to increase the model’s capability

        for learning long-range dependencies and effectively

        capturing global contextual representation at multiple scales.

        We validated the effectiveness of UNETR on different

        volumetric segmentation tasks in CT and MRI modalities.

        UNETR achieves new state-of-the-art performance in both

        Standard and Free Competitions on the BTCV leaderboard

        for the multi-organ segmentation and outperforms competing

        approaches for brain tumor and spleen segmentation on the

        MSD dataset. In conclusion, UNETR has shown the potential

        to effectively learn the critical anatomical relationships

        represented in medical images. The proposed method could

        be the foundation for a new class of transformer-based

        segmentation models in medical images analysis.
      - >-
        [ABSTRACT]

        While the Transformer architecture has become the de-facto standard for
        natural

        language processing tasks, its applications to computer vision remain
        limited. In

        vision, attention is either applied in conjunction with convolutional
        networks, or

        used to replace certain components of convolutional networks while
        keeping their

        overall structure in place. We show that this reliance on CNNs is not
        necessary

        and a pure transformer applied directly to sequences of image patches
        can perform

        very well on image classiﬁcation tasks. When pre-trained on large
        amounts of

        data and transferred to multiple mid-sized or small image recognition
        benchmarks

        (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains
        excellent

        results compared to state-of-the-art convolutional networks while
        requiring substantially fewer computational resources to train.1
  - source_sentence: >-
      Are there any frameworks that adapt to different types of image
      segmentation tasks?
    sentences:
      - "[ABSTRACT]\nDeeper neural networks are more difﬁcult to train. We\npresent a residual learning framework to ease the training\nof networks that are substantially deeper than those used\npreviously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual\nnetworks are easier to optimize, and can gain accuracy from\nconsiderably increased depth. On the ImageNet dataset we\nevaluate residual nets with a depth of up to 152 layers—8 ×\ndeeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error\non the ImageNettest set. This result won the 1st place on the\nILSVRC 2015 classiﬁcation task. We also present analysis\non CIFAR-10 with 100 and 1000 layers.\nThe depth of representations is of central importance\nfor many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep\nresidual nets are foundations of our submissions to ILSVRC\n& COCO 2015 competitions 1, where we also won the 1st\nplaces on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.\n\n[1. Introduction]\nDeep convolutional neural networks [22, 21] have led\nto a series of breakthroughs for image classiﬁcation [21,\n50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classiﬁers in an end-to-end multilayer fashion, and the “levels” of features can be enriched\nby the number of stacked layers (depth). Recent evidence\n[41, 44] reveals that network depth is of crucial importance,\nand the leading results [41, 44, 13, 16] on the challenging\nImageNet dataset [36] all exploit “very deep” [41] models,\nwith a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also\n1http://image-net.org/challenges/LSVRC/2015/ and\nhttp://mscoco.org/dataset/#detections-challenge2015.\n0 1 2 3 4 5 60 \n\niter. (1e4)\ntraining error (%)\n \n \n0 1 2 3 4 5 60\n\niter. (1e4)\ntest error (%)\n \n \n56-layer\n20-layer\n56-layer\n20-layer\nFigure 1. Training error (left) and test error (right) on CIFAR-10\nwith 20-layer and 56-layer “plain” networks. The deeper network\nhas higher training error, and thus test error. Similar phenomena\non ImageNet is presented in Fig. 4.\ngreatly beneﬁted from very deep models.\nDriven by the signiﬁcance of depth, a question arises: Is\nlearning better networks as easy as stacking more layers?\nAn obstacle to answering this question was the notorious\nproblem of vanishing/exploding gradients [1, 9], which\nhamper convergence from the beginning. This problem,\nhowever, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers\n[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].\nWhen deeper networks are able to start converging, a\ndegradation problem has been exposed: with the network\ndepth increasing, accuracy gets saturated (which might be\nunsurprising) and then degrades rapidly. Unexpectedly,\nsuch degradation is not caused by overﬁtting , and adding\nmore layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly veriﬁed by\nour experiments. Fig. 1 shows a typical example.\nThe degradation (of training accuracy) indicates that not\nall systems are similarly easy to optimize. Let us consider a\nshallower architecture and its deeper counterpart that adds\nmore layers onto it. There exists a solution by construction\nto the deeper model: the added layers are identity mapping,\nand the other layers are copied from the learned shallower\nmodel. The existence of this constructed solution indicates\nthat a deeper model should produce no higher training error\nthan its shallower counterpart. But experiments show that\nour current solvers on hand are unable to ﬁnd solutions that\n\narXiv:1512.03385v1  [cs.CV]  10 Dec 2015\nidentity\nweight layer\nweight layer\nrelu\nreluF(x)\x01+\x01x\nx\nF(x) x\nFigure 2. Residual learning: a building block.\nare comparably good or better than the constructed solution\n(or unable to do so in feasible time).\nIn this paper, we address the degradation problem by\nintroducing a deep residual learning framework. Instead of hoping each few stacked layers directly ﬁt a\ndesired underlying mapping, we explicitly let these layers ﬁt a residual mapping. Formally, denoting the desired\nunderlying mapping as H(x), we let the stacked nonlinear\nlayers ﬁt another mapping of F(x) :=H(x) −x. The original mapping is recast intoF(x)+ x. We hypothesize that it\nis easier to optimize the residual mapping than to optimize\nthe original, unreferenced mapping. To the extreme, if an\nidentity mapping were optimal, it would be easier to push\nthe residual to zero than to ﬁt an identity mapping by a stack\nof nonlinear layers.\nThe formulation of F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).\nShortcut connections [2, 34, 49] are those skipping one or\nmore layers. In our case, the shortcut connections simply\nperform identity mapping, and their outputs are added to\nthe outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained\nend-to-end by SGD with backpropagation, and can be easily implemented using common libraries ( e.g., Caffe [19])\nwithout modifying the solvers.\nWe present comprehensive experiments on ImageNet\n[36] to show the degradation problem and evaluate our\nmethod. We show that: 1) Our extremely deep residual nets\nare easy to optimize, but the counterpart “plain” nets (that\nsimply stack layers) exhibit higher training error when the\ndepth increases; 2) Our deep residual nets can easily enjoy\naccuracy gains from greatly increased depth, producing results substantially better than previous networks.\nSimilar phenomena are also shown on the CIFAR-10 set\n[20], suggesting that the optimization difﬁculties and the\neffects of our method are not just akin to a particular dataset.\nWe present successfully trained models on this dataset with\nover 100 layers, and explore models with over 1000 layers.\nOn the ImageNet classiﬁcation dataset [36], we obtain\nexcellent results by extremely deep residual nets. Our 152layer residual net is the deepest network ever presented on\nImageNet, while still having lower complexity than VGG\nnets [41]. Our ensemble has 3.57% top-5 error on the\nImageNet test set, and won the 1st place in the ILSVRC\n2015 classiﬁcation competition . The extremely deep representations also have excellent generalization performance\non other recognition tasks, and lead us to further win the\n1st places on: ImageNet detection, ImageNet localization,\nCOCO detection, and COCO segmentation in ILSVRC &\nCOCO 2015 competitions. This strong evidence shows that\nthe residual learning principle is generic, and we expect that\nit is applicable in other vision and non-vision problems.\n\n[3.3. Network Architectures]\nWe have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.\nPlain Network. Our plain baselines (Fig. 3, middle) are\nmainly inspired by the philosophy of VGG nets [41] (Fig. 3,\nleft). The convolutional layers mostly have 3 ×3 ﬁlters and\nfollow two simple design rules: (i) for the same output\nfeature map size, the layers have the same number of ﬁlters; and (ii) if the feature map size is halved, the number of ﬁlters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by\nconvolutional layers that have a stride of 2. The network\nends with a global average pooling layer and a 1000-way\nfully-connected layer with softmax. The total number of\nweighted layers is 34 in Fig. 3 (middle).\nIt is worth noticing that our model has fewer ﬁlters and\nlower complexity than VGG nets [41] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which\nis only 18% of VGG-19 (19.6 billion FLOPs).\n\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n3x3 conv, 512\n3x3 conv, 64\n3x3 conv, 64\npool, /2\n3x3 conv, 128\n3x3 conv, 128\npool, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\nfc 4096\nfc 4096\nfc 1000\nimage\noutput \nsize: 112\noutput \nsize: 224\noutput \nsize: 56\noutput \nsize: 28\noutput \nsize: 14\noutput \nsize: 7\noutput \nsize: 1\nVGG-19 34-layer plain\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n34-layer residual\nFigure 3. Example network architectures for ImageNet. Left: the\nVGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).\nRight: a residual network with 34 parameter layers (3.6 billion\nFLOPs). The dotted shortcuts increase dimensions. Table 1shows\nmore details and other variants.\nResidual Network. Based on the above plain network, we\ninsert shortcut connections (Fig. 3, right) which turn the\nnetwork into its counterpart residual version. The identity\nshortcuts (Eqn.(1)) can be directly used when the input and\noutput are of the same dimensions (solid line shortcuts in\nFig. 3). When the dimensions increase (dotted line shortcuts\nin Fig. 3), we consider two options: (A) The shortcut still\nperforms identity mapping, with extra zero entries padded\nfor increasing dimensions. This option introduces no extra\nparameter; (B) The projection shortcut in Eqn.(2) is used to\nmatch dimensions (done by 1 ×1 convolutions). For both\noptions, when the shortcuts go across feature maps of two\nsizes, they are performed with a stride of 2."
      - "[ABSTRACT]\nDeeper neural networks are more difﬁcult to train. We\npresent a residual learning framework to ease the training\nof networks that are substantially deeper than those used\npreviously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual\nnetworks are easier to optimize, and can gain accuracy from\nconsiderably increased depth. On the ImageNet dataset we\nevaluate residual nets with a depth of up to 152 layers—8 ×\ndeeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error\non the ImageNettest set. This result won the 1st place on the\nILSVRC 2015 classiﬁcation task. We also present analysis\non CIFAR-10 with 100 and 1000 layers.\nThe depth of representations is of central importance\nfor many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep\nresidual nets are foundations of our submissions to ILSVRC\n& COCO 2015 competitions 1, where we also won the 1st\nplaces on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.\n\n[1. Introduction]\nDeep convolutional neural networks [22, 21] have led\nto a series of breakthroughs for image classiﬁcation [21,\n50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classiﬁers in an end-to-end multilayer fashion, and the “levels” of features can be enriched\nby the number of stacked layers (depth). Recent evidence\n[41, 44] reveals that network depth is of crucial importance,\nand the leading results [41, 44, 13, 16] on the challenging\nImageNet dataset [36] all exploit “very deep” [41] models,\nwith a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also\n1http://image-net.org/challenges/LSVRC/2015/ and\nhttp://mscoco.org/dataset/#detections-challenge2015.\n0 1 2 3 4 5 60 \n\niter. (1e4)\ntraining error (%)\n \n \n0 1 2 3 4 5 60\n\niter. (1e4)\ntest error (%)\n \n \n56-layer\n20-layer\n56-layer\n20-layer\nFigure 1. Training error (left) and test error (right) on CIFAR-10\nwith 20-layer and 56-layer “plain” networks. The deeper network\nhas higher training error, and thus test error. Similar phenomena\non ImageNet is presented in Fig. 4.\ngreatly beneﬁted from very deep models.\nDriven by the signiﬁcance of depth, a question arises: Is\nlearning better networks as easy as stacking more layers?\nAn obstacle to answering this question was the notorious\nproblem of vanishing/exploding gradients [1, 9], which\nhamper convergence from the beginning. This problem,\nhowever, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers\n[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].\nWhen deeper networks are able to start converging, a\ndegradation problem has been exposed: with the network\ndepth increasing, accuracy gets saturated (which might be\nunsurprising) and then degrades rapidly. Unexpectedly,\nsuch degradation is not caused by overﬁtting , and adding\nmore layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly veriﬁed by\nour experiments. Fig. 1 shows a typical example.\nThe degradation (of training accuracy) indicates that not\nall systems are similarly easy to optimize. Let us consider a\nshallower architecture and its deeper counterpart that adds\nmore layers onto it. There exists a solution by construction\nto the deeper model: the added layers are identity mapping,\nand the other layers are copied from the learned shallower\nmodel. The existence of this constructed solution indicates\nthat a deeper model should produce no higher training error\nthan its shallower counterpart. But experiments show that\nour current solvers on hand are unable to ﬁnd solutions that\n\narXiv:1512.03385v1  [cs.CV]  10 Dec 2015\nidentity\nweight layer\nweight layer\nrelu\nreluF(x)\x01+\x01x\nx\nF(x) x\nFigure 2. Residual learning: a building block.\nare comparably good or better than the constructed solution\n(or unable to do so in feasible time).\nIn this paper, we address the degradation problem by\nintroducing a deep residual learning framework. Instead of hoping each few stacked layers directly ﬁt a\ndesired underlying mapping, we explicitly let these layers ﬁt a residual mapping. Formally, denoting the desired\nunderlying mapping as H(x), we let the stacked nonlinear\nlayers ﬁt another mapping of F(x) :=H(x) −x. The original mapping is recast intoF(x)+ x. We hypothesize that it\nis easier to optimize the residual mapping than to optimize\nthe original, unreferenced mapping. To the extreme, if an\nidentity mapping were optimal, it would be easier to push\nthe residual to zero than to ﬁt an identity mapping by a stack\nof nonlinear layers.\nThe formulation of F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).\nShortcut connections [2, 34, 49] are those skipping one or\nmore layers. In our case, the shortcut connections simply\nperform identity mapping, and their outputs are added to\nthe outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained\nend-to-end by SGD with backpropagation, and can be easily implemented using common libraries ( e.g., Caffe [19])\nwithout modifying the solvers.\nWe present comprehensive experiments on ImageNet\n[36] to show the degradation problem and evaluate our\nmethod. We show that: 1) Our extremely deep residual nets\nare easy to optimize, but the counterpart “plain” nets (that\nsimply stack layers) exhibit higher training error when the\ndepth increases; 2) Our deep residual nets can easily enjoy\naccuracy gains from greatly increased depth, producing results substantially better than previous networks.\nSimilar phenomena are also shown on the CIFAR-10 set\n[20], suggesting that the optimization difﬁculties and the\neffects of our method are not just akin to a particular dataset.\nWe present successfully trained models on this dataset with\nover 100 layers, and explore models with over 1000 layers.\nOn the ImageNet classiﬁcation dataset [36], we obtain\nexcellent results by extremely deep residual nets. Our 152layer residual net is the deepest network ever presented on\nImageNet, while still having lower complexity than VGG\nnets [41]. Our ensemble has 3.57% top-5 error on the\nImageNet test set, and won the 1st place in the ILSVRC\n2015 classiﬁcation competition . The extremely deep representations also have excellent generalization performance\non other recognition tasks, and lead us to further win the\n1st places on: ImageNet detection, ImageNet localization,\nCOCO detection, and COCO segmentation in ILSVRC &\nCOCO 2015 competitions. This strong evidence shows that\nthe residual learning principle is generic, and we expect that\nit is applicable in other vision and non-vision problems.\n\n[3.3. Network Architectures]\nWe have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.\nPlain Network. Our plain baselines (Fig. 3, middle) are\nmainly inspired by the philosophy of VGG nets [41] (Fig. 3,\nleft). The convolutional layers mostly have 3 ×3 ﬁlters and\nfollow two simple design rules: (i) for the same output\nfeature map size, the layers have the same number of ﬁlters; and (ii) if the feature map size is halved, the number of ﬁlters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by\nconvolutional layers that have a stride of 2. The network\nends with a global average pooling layer and a 1000-way\nfully-connected layer with softmax. The total number of\nweighted layers is 34 in Fig. 3 (middle).\nIt is worth noticing that our model has fewer ﬁlters and\nlower complexity than VGG nets [41] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which\nis only 18% of VGG-19 (19.6 billion FLOPs).\n\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n3x3 conv, 512\n3x3 conv, 64\n3x3 conv, 64\npool, /2\n3x3 conv, 128\n3x3 conv, 128\npool, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\nfc 4096\nfc 4096\nfc 1000\nimage\noutput \nsize: 112\noutput \nsize: 224\noutput \nsize: 56\noutput \nsize: 28\noutput \nsize: 14\noutput \nsize: 7\noutput \nsize: 1\nVGG-19 34-layer plain\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n34-layer residual\nFigure 3. Example network architectures for ImageNet. Left: the\nVGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).\nRight: a residual network with 34 parameter layers (3.6 billion\nFLOPs). The dotted shortcuts increase dimensions. Table 1shows\nmore details and other variants.\nResidual Network. Based on the above plain network, we\ninsert shortcut connections (Fig. 3, right) which turn the\nnetwork into its counterpart residual version. The identity\nshortcuts (Eqn.(1)) can be directly used when the input and\noutput are of the same dimensions (solid line shortcuts in\nFig. 3). When the dimensions increase (dotted line shortcuts\nin Fig. 3), we consider two options: (A) The shortcut still\nperforms identity mapping, with extra zero entries padded\nfor increasing dimensions. This option introduces no extra\nparameter; (B) The projection shortcut in Eqn.(2) is used to\nmatch dimensions (done by 1 ×1 convolutions). For both\noptions, when the shortcuts go across feature maps of two\nsizes, they are performed with a stride of 2."
      - >-
        [Preamble]

        nnU-Net: Self-adapting Framework

        for U-Net-Based Medical Image Segmentation

        Fabian Isensee, Jens Petersen, Andre Klein, David Zimmerer, Paul F.
        Jaeger,

        Simon Kohl, Jakob Wasserthal, Gregor K¨ ohler, Tobias Norajitra,
        Sebastian

        Wirkert, and Klaus H. Maier-Hein

        Division of Medical Image Computing, German Cancer Research Center
        (DKFZ),

        Heidelberg, Germany

        Abstract. The U-Net was presented in 2015. With its straight-forward

        and successful architecture it quickly evolved to a commonly used
        benchmark in medical image segmentation. The adaptation of the U-Net to

        novel problems, however, comprises several degrees of freedom regarding
        the exact architecture, pre-processing, training and inference. These

        choices are not independent of each other and substantially impact the

        overall performance. The present paper introduces the nnU-Net
        (”nonew-Net”), which refers to a robust and self-adapting framework on
        the

        basis of 2D and 3D vanilla U-Nets. We argue the strong case for taking

        away superﬂuous bells and whistles of many proposed network designs

        and instead focus on the remaining aspects that make out the performance
        and generalizability of a method. We evaluate the nnU-Net in the

        context of the Medical Segmentation Decathlon challenge, which measures
        segmentation performance in ten disciplines comprising distinct

        entities, image modalities, image geometries and dataset sizes, with no

        manual adjustments between datasets allowed. At the time of manuscript

        submission, nnU-Net achieves the highest mean dice scores across all

        classes and seven phase 1 tasks (except class 1 in BrainTumour) in the

        online leaderboard of the challenge.

        Keywords: Semantic Segmentation, Medical Imaging, U-Net


        [1 Introduction]

        Medical Image Segmentation is currently dominated by deep convolutional
        neural networks (CNNs). However, each segmentation benchmark seems to
        require

        specialized architectures and training scheme modiﬁcations to achieve
        competitive performance [1,2,3,4,5]. This results in huge amounts of
        publications in the

        ﬁeld that, alongside often limited validation on only few or even just a
        single

        dataset, make it increasingly diﬃcult for researchers to identify
        methods that live

        up to their promised superiority beyond the limited scenarios they are
        demonstrated on. The Medical Segmentation Decathlon is intended to
        speciﬁcally address this issue: participants in this challenge are asked
        to create a segmentation

        algorithm that generalizes across 10 datasets corresponding to diﬀerent
        entities

        arXiv:1809.10486v1  [cs.CV]  27 Sep 2018

        of the human body. These algorithms may dynamically adapt to the
        speciﬁcs

        of a particular dataset, but are only allowed to do so in a fully
        automatic manner. The challenge is split into two successive phases: 1)
        a development phase in

        which participants are given access to 7 datasets to optimize their
        approach on

        and, using their ﬁnal and thus frozen method, must submit segmentations
        for

        the corresponding 7 held-out test sets. 2) a second phase to evaluate
        the same

        exact method on 3 previously undisclosed datasets.

        We hypothesize that some of the architectural modiﬁcations presented
        recently are in part overﬁtted to speciﬁc problems or could suﬀer from
        imperfect

        validation that results from sub-optimal reimplementations of the
        state-of-theart. Using the U-Net as a benchmark on an in-house dataset,
        for example, requires the adaptation of the method to the novel problem.
        This spans several

        degrees of freedom. Even though the architecture itself is quite
        straight-forward,

        and even though the method is quite commonly used as a benchmark, we
        believe

        that the remaining interdependent choices regarding the exact
        architecture, preprocessing, training, inference and post-processing
        quite often cause the U-Net

        to underperform when used as a benchmark. Additionally, architectural
        tweaks

        that are intended to improve the performance of a network can rather
        easily

        be demonstrated to work if the network is not yet fully optimized for
        the task

        at hand, allowing for plenty of headroom for the tweak to improve
        results. In

        our own preliminary experiments, these tweaks however were unable to
        improve

        segmentation results in fully optimized networks and thus most likely
        unable

        to advance the state of the art. This leads us to believe that the
        inﬂuence of

        non-architectural aspects in segmentation methods is much more
        impactful, but

        at the same time also severely underestimated.

        In this paper, we present the nnU-Net (”no-new-Net”) framework. It
        resides

        on a set of three comparatively simple U-Net models that contain only
        minor

        modiﬁcations to the original U-Net [6]. We omit recently proposed
        extensions

        such as for example the use of residual connections [7,8], dense
        connections [5] or

        attention mechanisms [4]. The nnU-Net automatically adapts its
        architectures

        to the given image geometry. More importantly though, the nnU-Net
        framework

        thoroughly deﬁnes all the other steps around them. These are steps where
        much

        of the nets’ performance can be gained or respectively lost:
        preprocessing (e.g.

        resampling and normalization), training (e.g. loss, optimizer setting
        and data

        augmentation), inference (e.g. patch-based strategy and ensembling
        across testtime augmentations and models) and a potential
        post-processing (e.g. enforcing

        single connected components if applicable).


        [2.1 Network Architectures]

        Medical images commonly encompass a third dimension, which is why we
        consider a pool of basic U-Net architectures consisting of a 2D U-Net, a
        3D U-Net

        and a U-Net Cascade. While the 2D and 3D U-Nets generate segmentations

        at full resolution, the cascade ﬁrst generates low resolution
        segmentations and

        subsequently reﬁnes them. Our architectural modiﬁcations as compared to
        the

        U-Net’s original formulation are close to negligible and instead we
        focus our

        eﬀorts on designing an automatic training pipeline for these models.

        The U-Net [6] is a successful encoder-decoder network that has received
        a lot

        of attention in the recent years. Its encoder part works similarly to a
        traditional

        classiﬁcation CNN in that it successively aggregates semantic
        information at the

        expense of reduced spatial information. Since in segmentation, both
        semantic as

        well as spatial information are crucial for the success of a network,
        the missing

        spatial information must somehow be recovered. The U-Net does this
        through

        the decoder, which receives semantic information from the bottom of the
        ’U’

        and recombines it with higher resolution feature maps obtained directly
        from

        the encoder through skip connections. Unlike other segmentation
        networks, such

        as FCN [9] and previous iterations of DeepLab [10] this allows the U-Net
        to

        segment ﬁne structures particularly well.

        Just like the original U-Net, we use two plain convolutional layers
        between

        poolings in the encoder and transposed convolution operations in the
        decoder.

        We deviate from the original architecture in that we replace ReLU
        activation

        functions with leaky ReLUs (neg. slope 1e−2) and use instance
        normalization [11]

        instead of the more popular batch normalization [12].

        2D U-Net Intuitively, using a 2D U-Net in the context of 3D medical
        image segmentation appears to be suboptimal because valuable information
        along

        the z-axis cannot be aggregated and taken into consideration. However,
        there

        is evidence [13] that conventional 3D segmentation methods deteriorate
        in performance if the dataset is anisotropic (cf. Prostate dataset of
        the Decathlon

        challenge).

        3D U-Net A 3D U-Net seems like the appropriate method of choice for 3D

        image data. In an ideal world, we would train such an architecture on
        the entire

        patient’s image. In reality however, we are limited by the amount of
        available

        GPU memory which allows us to train this architecture only on image
        patches.

        While this is not a problem for datasets comprised of smaller images (in
        terms

        of number of voxels per patient) such as the Brain Tumour, Hippocampus
        and

        Prostate datasets of this challenge, patch-based training, as dictated
        by datasets

        with large images such as Liver, may impede training. This is due to the
        limited

        ﬁeld of view of the architecture which thus cannot collect suﬃcient
        contextual

        information to e.g. correctly distinguish parts of a liver from parts of
        other

        organs.

        U-Net Cascade To address this practical shortcoming of a 3D U-Net on

        datasets with large image sizes, we additionally propose a cascaded
        model. Therefore, a 3D U-Net is ﬁrst trained on downsampled images
        (stage 1). The segmentation results of this U-Net are then upsampled to
        the original voxel spacing and

        passed as additional (one hot encoded) input channels to a second 3D
        U-Net,

        which is trained on patches at full resolution (stage 2). See Figure 1.

        up/downsampling croppingfull res. image low res. seg. full res. seg.

        stage 1 stage 2

        skip conn.

        Fig. 1.U-Net Cascade (on applicable datasets only). Stage 1 (left): a 3D
        U-Net processes downsampled data, the resulting segmentation maps are
        upsampled to the original resolution. Stage 2 (right): these
        segmentations are concatenated as one-hot encodings to the full
        resolution data and reﬁned by a second 3D U-Net.

        Dynamic adaptation of network topologiesDue to the large diﬀerences

        in image size (median shape 482 ×512 ×512 for Liver vs. 36 ×50 ×35 for

        Hippocampus) the input patch size and number of pooling operations per
        axis

        (and thus implicitly the number of convolutional layers) must be
        automatically

        adapted for each dataset to allow for adequate aggregation of spatial
        information.

        Apart from adapting to the image geometries, there are technical
        constraints like

        the available memory to account for. Our guiding principle in this
        respect is to

        dynamically trade oﬀ the batch-size versus the network capacity,
        presented in

        detail below:

        We start out with network conﬁgurations that we know to be working with

        our hardware setup. For the 2D U-Net this conﬁguration is an input patch
        size of

        256×256, a batch size of 42 and 30 feature maps in the highest layers
        (number of

        feature maps doubles with each downsampling). We automatically adapt
        these

        parameters to the median plane size of each dataset (where we use the
        plane

        with the lowest in-plane spacing, corresponding to the highest
        resolution), so

        that the network eﬀectively trains on entire slices. We conﬁgure the
        networks to

        pool along each axis until the feature map size for that axis is smaller
        than 8 (but

        not more than a maximum of 6 pooling operations). Just like the 2D
        U-Net, our

        3D U-Net uses 30 feature maps at the highest resolution layers. Here we
        start

        with a base conﬁguration of input patch size 128 ×128 ×128, and a batch
        size

        of 2. Due to memory constraints, we do not increase the input patch
        volume

        beyond 1283 voxels, but instead match the aspect ratio of the input
        patch size

        to that of the median size of the dataset in voxels. If the median shape
        of the

        dataset is smaller than 128 3 then we use the median shape as input
        patch size

        and increase the batch size (so that the total number of voxels
        processed is the

        same as with 128 ×128 ×128 and a batch size of 2). Just like for the 2D
        U-Net

        we pool (for a maximum of 5 times) along each axis until the feature
        maps have

        size 8.

        For any network we limit the total number of voxels processed per
        optimizer

        step (deﬁned as the input patch volume times the batch size) to a
        maximum of

        2D U-Net 3D U-Net 3D U-Net lowres

        BrainTumour

        median patient shape 169x138 138x169x138 input patch size 192x160
        128x128x128 batch size 89 2 num pool per axis 5, 5 5, 5, 5 Heart

        median patient shape 320x232 115x320x232 58x160x116

        input patch size 320x256 80x192x128 64x160x128

        batch size 33 2 2

        num pool per axis 6, 6 4, 5, 5 4, 5, 5

        Liver

        median patient shape 512x512 482x512x512 121x128x128

        input patch size 512x512 128x128x128 128x128x128

        batch size 10 2 2

        num pool per axis 6, 6 5, 5, 5 5, 5, 5

        Hippocampus

        median patient shape 50x35 36x50x35 input patch size 56x40 40x56x40
        batch size 366 9 num pool per axis 3, 3 3, 3, 3 Prostate

        median patient shape 320x319 20x320x319 input patch size 320x320
        20x192x192 batch size 26 4 num pool per axis 6, 6 2, 5, 5 Lung

        median patient shape 512x512 252x512x512 126x256x256

        input patch size 512x512 112x128x128 112x128x128

        batch size 10 2 2

        num pool per axis 6, 6 4, 5, 5 4, 5, 5

        Pancreas

        median patient shape 512x512 96x512x512 96x256x256

        input patch size 512x512 96x160x128 96x160x128

        batch size 10 2 2

        num pool per axis 6, 6 4, 5, 5 4, 5, 5

        Table 1.Network topologies as automatically generated for the seven
        phase 1 tasks

        of the Medical Segmentation Decathlon challenge. 3D U-Net lowres refers
        to the ﬁrst

        stage of the U-Net Cascade. The conﬁguration of the second stage of the
        U-Net Cascade

        is identical to the 3D U-Net.

        5% of the dataset. For cases in excess, we reduce the batch size (with a
        lowerbound of 2).

        All network topologies generated for the phase 1 datasets are presented
        in

        table 2.1.
  - source_sentence: What is the role of online and target networks in the BYOL architecture?
    sentences:
      - >-
        [ABSTRACT]

        Relying entirely on an attention mechanism,

        the Transformer introduced by Vaswani et

        al. (2017) achieves state-of-the-art results for

        machine translation. In contrast to recurrent

        and convolutional neural networks, it does

        not explicitly model relative or absolute position information in its
        structure. Instead,

        it requires adding representations of absolute positions to its inputs.
        In this work

        we present an alternative approach, extending the self-attention
        mechanism to efﬁciently

        consider representations of the relative positions, or distances between
        sequence elements.

        On the WMT 2014 English-to-German and

        English-to-French translation tasks, this approach yields improvements
        of 1.3 BLEU and

        0.3 BLEU over absolute position representations, respectively. Notably,
        we observe that

        combining relative and absolute position representations yields no
        further improvement in

        translation quality. We describe an efﬁcient

        implementation of our method and cast it as an

        instance of relation-aware self-attention mechanisms that can generalize
        to arbitrary graphlabeled inputs.


        [1 Introduction]

        Recent approaches to sequence to sequence learning typically leverage
        recurrence (Sutskever et al.,

        2014), convolution (Gehring et al., 2017; Kalchbrenner et al., 2016),
        attention (Vaswani et al.,

        2017), or a combination of recurrence and attention (Bahdanau et al.,
        2014; Cho et al., 2014; Luong et al., 2015; Wu et al., 2016) as basic
        building

        blocks. These approaches incorporate information

        about the sequential position of elements differently.

        Recurrent neural networks (RNNs) typically

        compute a hidden state ht, as a function of their

        input at time t and a previous hidden state ht−1,

        capturing relative and absolute positions along the

        time dimension directly through their sequential

        structure. Non-recurrent models do not necessarily consider input
        elements sequentially and may

        hence require explicitly encoding position information to be able to use
        sequence order.

        One common approach is to use position encodings which are combined with
        input elements to

        expose position information to the model. These

        position encodings can be a deterministic function of position
        (Sukhbaatar et al., 2015; Vaswani

        et al., 2017) or learned representations. Convolutional neural networks
        inherently capture relative

        positions within the kernel size of each convolution. They have been
        shown to still beneﬁt from

        position encodings (Gehring et al., 2017), however.

        For the Transformer, which employs neither

        convolution nor recurrence, incorporating explicit

        representations of position information is an especially important
        consideration since the model is

        otherwise entirely invariant to sequence ordering.

        Attention-based models have therefore used position encodings or biased
        attention weights based

        on distance (Parikh et al., 2016).

        In this work we present an efﬁcient way of

        incorporating relative position representations in

        the self-attention mechanism of the Transformer.

        Even when entirely replacing its absolute position

        encodings, we demonstrate signiﬁcant improvements in translation quality
        on two machine translation tasks.

        Our approach can be cast as a special case of extending the
        self-attention mechanism of the Transformer to considering arbitrary
        relations between

        any two elements of the input, a direction we plan

        to explore in future work on modeling labeled, directed graphs.

        arXiv:1803.02155v2  [cs.CL]  12 Apr 2018


        [5 Conclusions]

        In this paper we presented an extension to selfattention that can be
        used to incorporate relative position information for sequences, which
        improves performance for machine translation.

        For future work, we plan to extend this mechanism to consider arbitrary
        directed, labeled graph

        inputs to the Transformer. We are also interested in nonlinear
        compatibility functions to combine input representations and edge
        representations. For both of these extensions, a key consideration will
        be determining efﬁcient implementations.
      - >-
        [ABSTRACT]

        We introduce Bootstrap Your Own Latent (BYOL), a new approach to
        self-supervised image

        representation learning. BYOL relies on two neural networks, referred to
        as online and target

        networks, that interact and learn from each other. From an augmented
        view of an image, we train

        the online network to predict the target network representation of the
        same image under a different

        augmented view. At the same time, we update the target network with a
        slow-moving average

        of the online network. While state-of-the art methods rely on negative
        pairs, BYOL achieves a

        new state of the art without them. BYOL reaches 74.3% top-1
        classiﬁcation accuracy on ImageNet

        using a linear evaluation with a ResNet-50 architecture and 79.6% with a
        larger ResNet. We

        show that BYOL performs on par or better than the current state of the
        art on both transfer and

        semi-supervised benchmarks. Our implementation and pretrained models are
        given on GitHub.3


        [1 Introduction]

        25M 50M 100M 200M 400M

        Number of parameters


        80ImageNet top-1 accuracy (%)

        Sup.

        Sup.(2×)

        Sup.(4×)

        InfoMin

        SimCLR

        SimCLR (2×)

        SimCLR (4×)

        MoCov2 CPCv2-L

        MoCo

        CMC

        AMDIM

        BYOL

        BYOL (2×)

        BYOL (4×)

        BYOL (200-2×)

        Sup.(200-2×)

        Figure 1: Performance of BYOL on ImageNet (linear evaluation) using
        ResNet-50 and our best architecture ResNet200 (2×), compared to other
        unsupervised and supervised

        (Sup .) baselines [8].

        Learning good image representations is a key challenge

        in computer vision [ 1, 2, 3] as it allows for efﬁcient

        training on downstream tasks [ 4, 5, 6, 7]. Many different training
        approaches have been proposed to learn

        such representations, usually relying on visual pretext

        tasks. Among them, state-of-the-art contrastive methods [8, 9, 10, 11,
        12] are trained by reducing the distance between representations of
        different augmented

        views of the same image (‘positive pairs’), and increasing the distance
        between representations of augmented

        views from different images (‘negative pairs’). These

        methods need careful treatment of negative pairs [13]

        by either relying on large batch sizes [8, 12], memory

        banks [9] or customized mining strategies [14, 15] to retrieve the
        negative pairs. In addition, their performance

        critically depends on the choice of image augmentations [8, 12].

        In this paper, we introduce Bootstrap Your Own

        Latent ( BYOL), a new algorithm for self-supervised

        learning of image representations. BYOL achieves higher

        performance than state-of-the-art contrastive methods

        ∗Equal contribution; the order of ﬁrst authors was randomly selected.

        3https://github.com/deepmind/deepmind-research/tree/master/byol

        arXiv:2006.07733v3  [cs.LG]  10 Sep 2020

        without using negative pairs. It iteratively bootstraps4 the outputs of
        a network to serve as targets for an enhanced

        representation. Moreover, BYOL is more robust to the choice of image
        augmentations than contrastive methods; we

        suspect that not relying on negative pairs is one of the leading reasons
        for its improved robustness. While previous

        methods based on bootstrapping have used pseudo-labels [16], cluster
        indices [17] or a handful of labels [18, 19, 20],

        we propose to directly bootstrap the representations. In particular,
        BYOL uses two neural networks, referred to

        as online and target networks, that interact and learn from each other.
        Starting from an augmented view of an

        image, BYOL trains its online network to predict the target network’s
        representation of another augmented view of

        the same image. While this objective admits collapsed solutions, e.g.,
        outputting the same vector for all images,

        we empirically show that BYOL does not converge to such solutions. We
        hypothesize (see Section 3.2) that the

        combination of (i) the addition of a predictor to the online network and
        (ii) the use of a slow-moving average of

        the online parameters as the target network encourages encoding more and
        more information within the online

        projection and avoids collapsed solutions.

        We evaluate the representation learned by BYOL on ImageNet [ 21] and
        other vision benchmarks using ResNet

        architectures [22]. Under the linear evaluation protocol on ImageNet,
        consisting in training a linear classiﬁer on

        top of the frozen representation, BYOL reaches 74.3% top-1 accuracy with
        a standard ResNet-50 and 79.6% top-1

        accuracy with a larger ResNet (Figure 1). In the semi-supervised and
        transfer settings on ImageNet, we obtain results

        on par or superior to the current state of the art. Our contributions
        are: ( i) We introduce BYOL, a self-supervised

        representation learning method (Section 3) which achieves
        state-of-the-art results under the linear evaluation

        protocol on ImageNet without using negative pairs. (ii) We show that our
        learned representation outperforms the

        state of the art on semi-supervised and transfer benchmarks (Section 4).
        (iii) We show that BYOL is more resilient to

        changes in the batch size and in the set of image augmentations compared
        to its contrastive counterparts (Section 5).

        In particular, BYOL suffers a much smaller performance drop than SimCLR,
        a strong contrastive baseline, when only

        using random crops as image augmentations.


        [3 Method]

        We start by motivating our method before explaining its details in
        Section 3.1. Many successful self-supervised

        learning approaches build upon the cross-view prediction framework
        introduced in [63]. Typically, these approaches

        learn representations by predicting different views (e.g., different
        random crops) of the same image from one

        another. Many such approaches cast the prediction problem directly in
        representation space: the representation of

        an augmented view of an image should be predictive of the representation
        of another augmented view of the same

        image. However, predicting directly in representation space can lead to
        collapsed representations: for instance, a

        representation that is constant across views is always fully predictive
        of itself. Contrastive methods circumvent

        this problem by reformulating the prediction problem into one of
        discrimination: from the representation of an

        augmented view, they learn to discriminate between the representation of
        another augmented view of the same

        image, and the representations of augmented views of different images.
        In the vast majority of cases, this prevents

        the training from ﬁnding collapsed representations. Yet, this
        discriminative approach typically requires comparing

        each representation of an augmented view with many negative examples, to
        ﬁnd ones sufﬁciently close to make the

        discrimination task challenging. In this work, we thus tasked ourselves
        to ﬁnd out whether these negative examples

        are indispensable to prevent collapsing while preserving high
        performance.

        To prevent collapse, a straightforward solution is to use a ﬁxed
        randomly initialized network to produce the targets

        for our predictions. While avoiding collapse, it empirically does not
        result in very good representations. Nonetheless,

        it is interesting to note that the representation obtained using this
        procedure can already be much better than the

        initial ﬁxed representation. In our ablation study (Section 5), we apply
        this procedure by predicting a ﬁxed randomly

        initialized network and achieve 18.8% top-1 accuracy (Table 5a) on the
        linear evaluation protocol on ImageNet,

        whereas the randomly initialized network only achieves 1.4% by itself.
        This experimental ﬁnding is the core

        motivation for BYOL: from a given representation, referred to as target,
        we can train a new, potentially enhanced

        representation, referred to as online, by predicting the target
        representation. From there, we can expect to build a

        sequence of representations of increasing quality by iterating this
        procedure, using subsequent online networks as

        new target networks for further training. In practice, BYOL generalizes
        this bootstrapping procedure by iteratively

        reﬁning its representation, but using a slowly moving exponential
        average of the online network as the target network

        instead of ﬁxed checkpoints.


        [6 Conclusion]

        We introduced BYOL, a new algorithm for self-supervised learning of
        image representations. BYOL learns its

        representation by predicting previous versions of its outputs, without
        using negative pairs. We show that BYOL

        achieves state-of-the-art results on various benchmarks. In particular,
        under the linear evaluation protocol on

        ImageNet with a ResNet- 50 (1×), BYOL achieves a new state of the art
        and bridges most of the remaining gap

        between self-supervised methods and the supervised learning baseline of
        [ 8]. Using a ResNet- 200 (2×), BYOL

        reaches a top-1 accuracy of 79.6% which improves over the previous state
        of the art (76.8%) while using 30% fewer

        parameters.

        Nevertheless, BYOL remains dependent on existing sets of augmentations
        that are speciﬁc to vision applications.

        To generalize BYOL to other modalities (e.g., audio, video, text, . . .
        ) it is necessary to obtain similarly suitable

        augmentations for each of them. Designing such augmentations may require
        signiﬁcant effort and expertise.

        Therefore, automating the search for these augmentations would be an
        important next step to generalize BYOL to

        other modalities.


        Broader impact

        The presented research should be categorized as research in the ﬁeld of
        unsupervised learning. This work may

        inspire new algorithms, theoretical, and experimental investigation. The
        algorithm presented here can be used for

        many different vision applications and a particular use may have both
        positive or negative impacts, which is known

        as the dual use problem. Besides, as vision datasets could be biased,
        the representation learned by BYOL could be

        susceptible to replicate these biases.

        Acknowledgements

        The authors would like to thank the following people for their help
        throughout the process of writing this paper, in

        alphabetical order: Aaron van den Oord, Andrew Brock, Jason Ramapuram,
        Jeffrey De Fauw, Karen Simonyan,

        Katrina McKinney, Nathalie Beauguerlange, Olivier Henaff, Oriol Vinyals,
        Pauline Luc, Razvan Pascanu, Sander

        Dieleman, and the DeepMind team. We especially thank Jason Ramapuram and
        Jeffrey De Fauw, who provided the

        JAX SimCLR reproduction used throughout the paper.
      - >-
        [ABSTRACT]

        The dominant sequence transduction models are based on complex recurrent
        or

        convolutional neural networks that include an encoder and a decoder. The
        best

        performing models also connect the encoder and decoder through an
        attention

        mechanism. We propose a new simple network architecture, the
        Transformer,

        based solely on attention mechanisms, dispensing with recurrence and
        convolutions

        entirely. Experiments on two machine translation tasks show these models
        to

        be superior in quality while being more parallelizable and requiring
        significantly

        less time to train. Our model achieves 28.4 BLEU on the WMT 2014
        Englishto-German translation task, improving over the existing best
        results, including

        ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation
        task,

        our model establishes a new single-model state-of-the-art BLEU score of
        41.8 after

        training for 3.5 days on eight GPUs, a small fraction of the training
        costs of the

        best models from the literature. We show that the Transformer
        generalizes well to

        other tasks by applying it successfully to English constituency parsing
        both with

        large and limited training data.

        ∗Equal contribution. Listing order is random. Jakob proposed replacing
        RNNs with self-attention and started

        the effort to evaluate this idea. Ashish, with Illia, designed and
        implemented the first Transformer models and

        has been crucially involved in every aspect of this work. Noam proposed
        scaled dot-product attention, multi-head

        attention and the parameter-free position representation and became the
        other person involved in nearly every

        detail. Niki designed, implemented, tuned and evaluated countless model
        variants in our original codebase and

        tensor2tensor. Llion also experimented with novel model variants, was
        responsible for our initial codebase, and

        efficient inference and visualizations. Lukasz and Aidan spent countless
        long days designing various parts of and

        implementing tensor2tensor, replacing our earlier codebase, greatly
        improving results and massively accelerating

        our research.

        †Work performed while at Google Brain.

        ‡Work performed while at Google Research.

        31st Conference on Neural Information Processing Systems (NIPS 2017),
        Long Beach, CA, USA.

        arXiv:1706.03762v7  [cs.CL]  2 Aug 2023


        [1 Introduction]

        Recurrent neural networks, long short-term memory [13] and gated
        recurrent [7] neural networks

        in particular, have been firmly established as state of the art
        approaches in sequence modeling and

        transduction problems such as language modeling and machine translation
        [ 35, 2, 5]. Numerous

        efforts have since continued to push the boundaries of recurrent
        language models and encoder-decoder

        architectures [38, 24, 15].

        Recurrent models typically factor computation along the symbol positions
        of the input and output

        sequences. Aligning the positions to steps in computation time, they
        generate a sequence of hidden

        states ht, as a function of the previous hidden state ht−1 and the input
        for position t. This inherently

        sequential nature precludes parallelization within training examples,
        which becomes critical at longer

        sequence lengths, as memory constraints limit batching across examples.
        Recent work has achieved

        significant improvements in computational efficiency through
        factorization tricks [21] and conditional

        computation [32], while also improving model performance in case of the
        latter. The fundamental

        constraint of sequential computation, however, remains.

        Attention mechanisms have become an integral part of compelling sequence
        modeling and transduction models in various tasks, allowing modeling of
        dependencies without regard to their distance in

        the input or output sequences [2, 19]. In all but a few cases [27],
        however, such attention mechanisms

        are used in conjunction with a recurrent network.

        In this work we propose the Transformer, a model architecture eschewing
        recurrence and instead

        relying entirely on an attention mechanism to draw global dependencies
        between input and output.

        The Transformer allows for significantly more parallelization and can
        reach a new state of the art in

        translation quality after being trained for as little as twelve hours on
        eight P100 GPUs.


        [3 Model Architecture]

        Most competitive neural sequence transduction models have an
        encoder-decoder structure [5, 2, 35].

        Here, the encoder maps an input sequence of symbol representations (x1,
        ..., xn) to a sequence

        of continuous representations z = (z1, ..., zn). Given z, the decoder
        then generates an output

        sequence (y1, ..., ym) of symbols one element at a time. At each step
        the model is auto-regressive

        [10], consuming the previously generated symbols as additional input
        when generating the next.


        Figure 1: The Transformer - model architecture.

        The Transformer follows this overall architecture using stacked
        self-attention and point-wise, fully

        connected layers for both the encoder and decoder, shown in the left and
        right halves of Figure 1,

        respectively.


        [3.2.3 Applications Of Attention In Our Model]

        The Transformer uses multi-head attention in three different ways:

        • In "encoder-decoder attention" layers, the queries come from the
        previous decoder layer,

        and the memory keys and values come from the output of the encoder. This
        allows every

        position in the decoder to attend over all positions in the input
        sequence. This mimics the

        typical encoder-decoder attention mechanisms in sequence-to-sequence
        models such as

        [38, 2, 9].

        • The encoder contains self-attention layers. In a self-attention layer
        all of the keys, values

        and queries come from the same place, in this case, the output of the
        previous layer in the

        encoder. Each position in the encoder can attend to all positions in the
        previous layer of the

        encoder.

        • Similarly, self-attention layers in the decoder allow each position in
        the decoder to attend to

        all positions in the decoder up to and including that position. We need
        to prevent leftward

        information flow in the decoder to preserve the auto-regressive
        property. We implement this

        inside of scaled dot-product attention by masking out (setting to −∞)
        all values in the input

        of the softmax which correspond to illegal connections. See Figure 2.


        [Model]

        BLEU Training Cost (FLOPs)

        EN-DE EN-FR EN-DE EN-FR

        ByteNet [18] 23.75

        Deep-Att + PosUnk [39] 39.2 1.0 · 1020

        GNMT + RL [38] 24.6 39.92 2.3 · 1019 1.4 · 1020

        ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020

        MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020

        Deep-Att + PosUnk Ensemble [39] 40.4 8.0 · 1020

        GNMT + RL Ensemble [38] 26.30 41.16 1.8 · 1020 1.1 · 1021

        ConvS2S Ensemble [9] 26.36 41.29 7.7 · 1019 1.2 · 1021

        Transformer (base model) 27.3 38.1 3.3 · 1018

        Transformer (big) 28.4 41.8 2.3 · 1019

        Residual Dropout We apply dropout [33] to the output of each sub-layer,
        before it is added to the

        sub-layer input and normalized. In addition, we apply dropout to the
        sums of the embeddings and the

        positional encodings in both the encoder and decoder stacks. For the
        base model, we use a rate of

        Pdrop = 0.1.

        Label Smoothing During training, we employed label smoothing of value
        ϵls = 0.1 [36]. This

        hurts perplexity, as the model learns to be more unsure, but improves
        accuracy and BLEU score.


        [7 Conclusion]

        In this work, we presented the Transformer, the first sequence
        transduction model based entirely on

        attention, replacing the recurrent layers most commonly used in
        encoder-decoder architectures with

        multi-headed self-attention.

        For translation tasks, the Transformer can be trained significantly
        faster than architectures based

        on recurrent or convolutional layers. On both WMT 2014 English-to-German
        and WMT 2014

        English-to-French translation tasks, we achieve a new state of the art.
        In the former task our best

        model outperforms even all previously reported ensembles.

        We are excited about the future of attention-based models and plan to
        apply them to other tasks. We

        plan to extend the Transformer to problems involving input and output
        modalities other than text and

        to investigate local, restricted attention mechanisms to efficiently
        handle large inputs and outputs

        such as images, audio and video. Making generation less sequential is
        another research goals of ours.

        The code we used to train and evaluate our models is available at
        https://github.com/

        tensorflow/tensor2tensor.

        Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws
        for their fruitful

        comments, corrections and inspiration.
  - source_sentence: >-
      Are there any frameworks that adapt to different types of image
      segmentation tasks?
    sentences:
      - >-
        [ABSTRACT]

        Academic research in recommender systems has been greatly focusing on
        the accuracy-related measures of recommendations. Even

        when non-accuracy measures such as popularity bias, diversity,

        and novelty are studied, it is often solely from the users’ perspective.
        However, many real-world recommenders are often multistakeholder
        environments in which the needs and interests of several stakeholders
        should be addressed in the recommendation process. In this paper, we
        focus on the popularity bias problem which

        is a well-known property of many recommendation algorithms

        where few popular items are over-recommended while the majority

        of other items do not get proportional attention and address its

        impact on different stakeholders. Using several recommendation

        algorithms and two publicly available datasets in music and movie

        domains, we empirically show the inherent popularity bias of the

        algorithms and how this bias impacts different stakeholders such

        as users and suppliers of the items. We also propose metrics to

        measure the exposure bias of recommendation algorithms from the

        perspective of different stakeholders.

        KEYWORDS

        Multi-sided platforms, Recommender systems, Popularity bias,
        Multistakeholder recommendation


        [1 Introduction]

        Popularity bias is a well-known phenomenon in recommender systems:
        popular items are recommended even more frequently than

        their popularity would warrant, amplifying the long-tail effect already
        present in many recommendation domains. Prior research

        has examined the impact of this bias on some properties of the

        recommenders such as aggregate diversity (aka catalog coverage)

        [4, 22]. One of the consequences of the popularity bias is disfavoring
        less popular items where the recommendations are not fair in

        terms of the amount of exposure they give to different items with

        varying degree of popularity: an exposure bias. However, as we

        discuss in [1], many recommender systems are multi-stakeholder

        environments in which the needs and interests of multiple stakeholders
        should be taken into account in the implementation and

        evaluation of such systems.

        In many multi-stakeholer recommenders as described in [ 1]

        two main stakeholders (or what often is being referred to as sides

        in multi-sided platforms [11] ) can be identified: consumers (aka

        users) and suppliers. For instance, in a music platform such as

        Spotify, on one side there are users who get recommendations for

        songs in which they are interested and, on the other side, there are

        ∗ This author also has affiliation in School of Computing, DePaul
        University, Chicago,

        USA, mmansou4@depaul.edu.

        artists whose songs are being recommended to different users. The

        popularity bias can be investigated from both sides’ perspective.

        Regarding the users, not everyone has the same level of interest

        in popular items. In the music domain as an example, some users

        might be interested in internationally popular artists such as Drake,

        Beyoncé, or Ed Sheeran and some might be more interested in

        artists from their own culture that might not necessarily have the

        same popularity as the aforementioned artists (such as the Iranian

        musician Kayhan Kalhor) or generally they prefer certain type

        of music that might not be popular among the majority of other

        users (such as country music). With that being said, we expect the

        personalization to handle this difference in taste but as we will see

        in section 4.1 that is certainly not the case.

        The suppliers also do not have the same level of popularity.

        In many recommendation domains including movies, music, or

        even house sharing, few suppliers have a large audience while

        the majority of others may not be as popular though they still

        might have their fair share of audience. Now the question is, do

        recommender systems let different suppliers with varying degree

        of popularity reach their desired audience? Again, the short answer

        is No as we will see more details in section 4.2.

        Investigating the impact of recommendation algorithms on the

        exposure bias on both users and suppliers is the focus of this paper.

        We study several recommendation models in terms of their inherent

        popularity bias and propose metrics that can measure such impact.


        [6 Conclusion]

        Recommender systems are multi-stakeholder environments; in addition to
        the users, some other stakeholders such as the supplier

        of the items also benefit from the recommendation of their items

        and gaining a larger audience. The algorithmic popularity bias can

        negatively impact both users and suppliers on a recommender system
        platform. In this paper, we demonstrated the severity of the

        popularity bias impact on different sides of a recommender system

        using several recommendation algorithms on two datasets. We also

        proposed metrics to quantify the exposure bias from the perspective of
        both the users and suppliers. Our experiments showed that

        when the recommendations are calibrated for the users in terms of

        popularity (lower U PD), it will also benefit the suppliers of the
        recommendations by giving them proportional exposure (lower SPD ).

        We believe, it is extremely crucial for the recommender systems

        researchers to see the implications of real-world recommenders

        where a single-stakeholder focus might not address all the complexities.
      - "[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoﬀmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saﬀron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoﬀrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After ﬁne-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adiﬀerentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scﬁt pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9D𝑖º. Neural\nnetworks have proven to be powerful language models, ﬁrst in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe beneﬁts of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring eﬃcient means of augmenting language\nmodels with a massive-scale memory without signiﬁcantly increasing computations. Speciﬁcally, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkistheﬁrsttoshowthebeneﬁts\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoﬀmann|sifre}@deepmind.com\narXiv:2112.04426v3  [cs.CL]  7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be ﬁne-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method ﬁrst\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simpliﬁed version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the ﬁrst chunk only aﬀect the last token of the ﬁrst chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we ﬁlter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-ﬂy is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We ﬁnd that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three diﬀerent residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classiﬁcation\nhead deﬁnes an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a diﬀerentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\01º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we ﬁrst split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\01 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\01º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\01¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the ﬁrst\n𝑚\01 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\01 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\01 , C/a.sc¹ℎ𝑢𝑚¸𝑖\01\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the ﬁrst to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\01º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\01º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢  A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢  C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢  F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻  E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻  A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻  C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻  F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is deﬁned in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we deﬁne\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe ﬁrst𝑚\01 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndeﬁne C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \01¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simpliﬁed implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\01º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9D𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are ﬂexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly ﬁne-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more eﬃcient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeﬀStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\01º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close oﬀtheir noses, ears, and\nin the large pond oﬀ Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nﬂage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, ﬂ es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their ﬂat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof theﬁrst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their ﬂat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting diﬀerences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\01º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens"
      - >-
        [ABSTRACT]

        We propose a novel attention gate (AG) model for medical imaging that
        automatically learns to focus on target structures of varying shapes and
        sizes. Models

        trained with AGs implicitly learn to suppress irrelevant regions in an
        input image

        while highlighting salient features useful for a speciﬁc task. This
        enables us to

        eliminate the necessity of using explicit external tissue/organ
        localisation modules

        of cascaded convolutional neural networks (CNNs). AGs can be easily
        integrated

        into standard CNN architectures such as the U-Net model with minimal
        computational overhead while increasing the model sensitivity and
        prediction accuracy.

        The proposed Attention U-Net architecture is evaluated on two large CT
        abdominal

        datasets for multi-class image segmentation. Experimental results show
        that AGs

        consistently improve the prediction performance of U-Net across
        different datasets

        and training sizes while preserving computational efﬁciency. The source
        code for

        the proposed architecture is publicly available.


        [1 Introduction]

        Automated medical image segmentation has been extensively studied in the
        image analysis community

        due to the fact that manual, dense labelling of large amounts of medical
        images is a tedious and

        error-prone task. Accurate and reliable solutions are desired to
        increase clinical work ﬂow efﬁciency

        and support decision making through fast and automatic extraction of
        quantitative measurements.

        With the advent of convolutional neural networks (CNNs),
        near-radiologist level performance can

        be achieved in automated medical image analysis tasks including cardiac
        MR segmentation [3] and

        cancerous lung nodule detection [17]. High representation power, fast
        inference, and ﬁlter sharing

        properties have made CNNs the de facto standard for image segmentation.
        Fully convolutional

        networks (FCNs) [ 18] and the U-Net [ 24] are two commonly used
        architectures. Despite their

        good representational power, these architectures rely on multi-stage
        cascaded CNNs when the target

        organs show large inter-patient variation in terms of shape and size.
        Cascaded frameworks extract a

        region of interest (ROI) and make dense predictions on that particular
        ROI. The application areas

        include cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27]
        segmentation, and lung CT

        nodule detection [17]. However, this approach leads to excessive and
        redundant use of computational

        resources and model parameters; for instance, similar low-level features
        are repeatedly extracted by

        all models within the cascade. To address this general problem, we
        propose a simple and yet effective

        solution, namely attention gates(AGs). CNN models with AGs can be
        trained from scratch in a

        standard way similar to the training of a FCN model, and AGs
        automatically learn to focus on target

        1st Conference on Medical Imaging with Deep Learning (MIDL 2018),
        Amsterdam, The Netherlands.

        arXiv:1804.03999v3  [cs.CV]  20 May 2018

        structures without additional supervision. At test time, these gates
        generate soft region proposals

        implicitly on-the-ﬂy and highlight salient features useful for a speciﬁc
        task. Moreover, they do not

        introduce signiﬁcant computational overhead and do not require a large
        number of model parameters

        as in the case of multi-model frameworks. In return, the proposed AGs
        improve model sensitivity and

        accuracy for dense label predictions by suppressing feature activations
        in irrelevant regions. In this

        way, the necessity of using an external organ localisation model can be
        eliminated while maintaining

        the high prediction accuracy. Similar attention mechanisms have been
        proposed for natural image

        classiﬁcation [11] and captioning [1] to perform adaptive feature
        pooling, where model predictions

        are conditioned only on a subset of selected image regions. In this
        paper, we generalise this design

        and propose image-grid based gating that allows attention coefﬁcients to
        be speciﬁc to local regions.

        Moreover, our approach can be used for attention-based dense
        predictions.

        We demonstrate the implementation of AG in a standard U-Net architecture
        (Attention U-Net) and

        apply it to medical images. We choose the challenging CT pancreas
        segmentation problem to provide

        experimental evidence for our proposed contributions. This problem
        constitutes a difﬁcult task due to

        low tissue contrast and large variability in organ shape and size. We
        evaluate our implementation on

        two commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class
        abdominal CT-150.

        The results show that AGs consistenly improve prediction accuracy across
        different datasets and

        training sizes while achieving state-of-the-art performance without
        requiring multiple CNN models.


        [2 Methodology]

        Fully Convolutional Network (FCN):Convolutional neural networks (CNNs)
        outperform traditional approaches in medical image analysis on public
        benchmark datasets [14, 17] while being an

        order of magnitude faster than, e.g., graph-cut and multi-atlas
        segmentation techniques [34]. This

        is mainly attributed to the fact that (I) domain speciﬁc image features
        are learnt using stochastic

        gradient descent (SGD) optimisation, (II) learnt kernels are shared
        across all pixels, and (III) image

        convolution operations exploit the structural information in medical
        images well. In particular, fully

        convolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13]
        and holistically nested

        networks [16, 35] have been shown to achieve robust and accurate
        performance in various tasks

        including cardiac MR [3], brain tumours [12] and abdominal CT [26, 27]
        image segmentation tasks.

        Convolutional layers progressively extract higher dimensional image
        representations ( xl) by processing local information layer by layer.
        Eventually, this separates pixels in a high dimensional

        space according to their semantics. Through this sequential process,
        model predictions are conditioned on information collected from a large
        receptive ﬁeld. Hence, feature-map xl is obtained

        at the output of layer lby sequentially applying a linear transformation
        followed by a non-linear

        activation function. It is often chosen as rectiﬁed linear unit: σ1 ( xl

        i,c) = max(0 ,xl

        i,c) where iand

        c denote spatial and channel dimensions respectively. Feature
        activations can be formulated as:

        xl

        c = σ1

        (∑

        c′∈Fl

        xl−1

        c′ ∗kc′,c

        )

        where ∗denotes the convolution operation, and the spatial subscript

        (i) is omitted in the formulation for notational clarity. The function
        f( xl; Φ l) = x(l+1) applied in

        convolution layer lis characterised by trainable kernel parameters Φ l.
        The parameters are learnt by


        Attention GateLorem ipsum dolor sit amet,consectetur adipisicing elit,
        seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.
         x x x 
         x x x 
        ReLU 

        x 
         
        Sigmoid Resampler

        x x 

        Figure 2: Schematic of the proposed additive attention gate (AG). Input
        features ( xl) are scaled

        with attention coefﬁcients (α) computed in AG. Spatial regions are
        selected by analysing both the

        activations and contextual information provided by the gating signal (g)
        which is collected from a

        coarser scale. Grid resampling of attention coefﬁcients is done using
        trilinear interpolation.

        minimising a training objective, e.g. cross-entropy loss, using
        stochastic gradient descent (SGD).

        In this paper, we build our attention model on top of a standard U-Net
        architecture. U-Nets are

        commonly used for image segmentation tasks because of their good
        performance and efﬁcient use

        of GPU memory. The latter advantage is mainly linked to extraction of
        image features at multiple

        image scales. Coarse feature-maps capture contextual information and
        highlight the category and

        location of foreground objects. Feature-maps extracted at multiple
        scales are later merged through

        skip connections to combine coarse- and ﬁne-level dense predictions as
        shown in Figure 1.

        Attention Gates for Image Analysis:To capture a sufﬁciently large
        receptive ﬁeld and thus, semantic contextual information, the
        feature-map grid is gradually downsampled in standard CNN

        architectures. In this way, features on the coarse spatial grid level
        model location and relationship

        between tissues at global scale. However, it remains difﬁcult to reduce
        false-positive predictions for

        small objects that show large shape variability. In order to improve the
        accuracy, current segmentation

        frameworks [14, 26, 27] rely on additional preceding object localisation
        models to simplify the task

        into separate localisation and subsequent segmentation steps. Here, we
        demonstrate that the same

        objective can be achieved by integrating attention gates (AGs) in a
        standard CNN model. This does

        not require the training of multiple models and a large number of extra
        model parameters. In contrast

        to the localisation model in multi-stage CNNs, AGs progressively
        suppress feature responses in

        irrelevant background regions without the requirement to crop a ROI
        between networks.

        Attention coefﬁcients, αi ∈[0,1], identify salient image regions and
        prune feature responses to

        preserve only the activations relevant to the speciﬁc task as shown in
        Figure 3a. The output of AGs is

        the element-wise multiplication of input feature-maps and attention
        coefﬁcients: ˆxl

        i,c = xl

        i,c ·αl

        i. In

        a default setting, a single scalar attention value is computed for each
        pixel vector xl

        i ∈RFl where

        Fl corresponds to the number of feature-maps in layer l. In case of
        multiple semantic classes, we

        propose to learn multi-dimensional attention coefﬁcients. This is
        inspired by [ 29], where multidimensional attention coefﬁcients are used
        to learn sentence embeddings. Thus, each AG learns to

        focus on a subset of target structures. As shown in Figure 2, a gating
        vector gi ∈RFg is used for

        each pixel ito determine focus regions. The gating vector contains
        contextual information to prune

        lower-level feature responses as suggested in [32], which uses AGs for
        natural image classiﬁcation.

        We use additive attention [2] to obtain the gating coefﬁcient. Although
        this is computationally more

        expensive, it has experimentally shown to achieve higher accuracy than
        multiplicative attention [19].

        Additive attention is formulated as follows:

        ql

        att = ψT (

        σ1 ( WT

        x xl

        i + WT

        g gi + bg)

        )

        + bψ (1)

        αl

        i = σ2( ql

        att(xl

        i, gi; Θatt) ), (2)

        where σ2(xi,c) = 1

        1+exp(−xi,c) correspond to sigmoid activation function. AG is
        characterised

        by a set of parameters Θatt containing: linear transformations Wx
        ∈RFl×Fint, Wg ∈RFg×Fint,

        ψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations
        are computed using

        channel-wise 1x1x1 convolutions for the input tensors. In other contexts
        [33], this is referred to as

        vector concatenation-based attention, where the concatenated features xl
        and gare linearly mapped

        to a RFint dimensional intermediate space. In image captioning [1] and
        classiﬁcation [11] tasks, the


        Figure 3(a): From left to right (a-e, f-j): Axial and sagittal views of
        a

        3D abdominal CT scan, attention coefﬁcients, feature activations of

        a skip connection before and after gating. Similarly, (k-n) visualise

        the gating on a coarse scale skip connection. The ﬁltered feature

        activations (d-e, i-j) are collected from multiple AGs, where a subset

        of organs is selected by each gate. Activations shown in (d-e, i-j)

        consistently correspond to speciﬁc structures across different scans.

        Figure 3(b): The ground-truth pancreas

        segmentation (a) is highlighted in blue

        (b). Similarly, U-Net model prediction

        (c) and the predictions obtained with Attention U-Net (d) are shown. The
        missed

        dense predictions by U-Net are highlighted with red arrows.

        softmax activation function is used to normalise the attention
        coefﬁcients (σ2); however, sequential

        use of softmax yields sparser activations at the output. For this
        reason, we choose a sigmoid activation

        function. This results experimentally in better training convergence for
        the AG parameters. In contrast

        to [11] we propose a grid-attention technique. In this case, gating
        signal is not a global single vector

        for all image pixels but a grid signal conditioned to image spatial
        information. More importantly, the

        gating signal for each skip connection aggregates information from
        multiple imaging scales, as shown

        in Figure 1, which increases the grid-resolution of the query signal and
        achieve better performance.

        Lastly, we would like to note that AG parameters can be trained with the
        standard back-propagation

        updates without a need for sampling based update methods used in
        hard-attention [21].

        Attention Gates in U-Net Model:The proposed AGs are incorporated into
        the standard U-Net

        architecture to highlight salient features that are passed through the
        skip connections, see Figure

        1. Information extracted from coarse scale is used in gating to
        disambiguate irrelevant and noisy

        responses in skip connections. This is performed right before the
        concatenation operation to merge

        only relevant activations. Additionally, AGs ﬁlter the neuron
        activations during the forward pass as

        well as during the backward pass. Gradients originating from background
        regions are down weighted

        during the backward pass. This allows model parameters in shallower
        layers to be updated mostly

        based on spatial regions that are relevant to a given task. The update
        rule for convolution parameters

        in layer l−1 can be formulated as follows:

        ∂(ˆxl

        i)

        ∂(Φl−1) = ∂

        (

        αl

        if(xl−1

        i ; Φl−1)

        )

        ∂(Φl−1) = αl

        i

        ∂(f(xl−1

        i ; Φl−1))

        ∂(Φl−1) + ∂(αl

        i)

        ∂(Φl−1) xl

        i (3)

        The ﬁrst gradient term on the right-hand side is scaled with αl

        i. In case of multi-dimensional AGs, αl

        i

        corresponds to a vector at each grid scale. In each sub-AG,
        complementary information is extracted

        and fused to deﬁne the output of skip connection. To reduce the number
        of trainable parameters

        and computational complexity of AGs, the linear transformations are
        performed without any spatial

        support (1x1x1 convolutions) and input feature-maps are downsampled to
        the resolution of gating

        signal, similar to non-local blocks [ 33]. The corresponding linear
        transformations decouple the

        feature-maps and map them to lower dimensional space for the gating
        operation. As suggested in

        [11], low-level feature-maps, i.e. the ﬁrst skip connections, are not
        used in the gating function since

        they do not represent the input data in a high dimensional space. We use
        deep-supervision [16] to

        force the intermediate feature-maps to be semantically discriminative at
        each image scale. This helps

        to ensure that attention units, at different scales, have an ability to
        inﬂuence the responses to a large

        range of image foreground content. We therefore prevent dense
        predictions from being reconstructed

        from small subsets of skip connections.


        [4 Discussion And Conclusion]

        In this paper, we presented a novel attention gate model applied to
        medical image segmentation. Our

        approach eliminates the necessity of applying an external object
        localisation model. The proposed

        approach is generic and modular as such it can be easily applied to
        image classiﬁcation and regression

        problems as in the examples of natural image analysis and machine
        translation. Experimental

        results demonstrate that the proposed AGs are highly beneﬁcial for
        tissue/organ identiﬁcation and

        localisation. This is particularly true for variable small size organs
        such as the pancreas, and similar

        behaviour is expected for global classiﬁcation tasks.

        Training behaviour of the AGs can beneﬁt from transfer learning and
        multi-stage training schemes.

        For instance, pre-trained U-Net weights can be used to initialise the
        attention network, and gates can

        be trained accordingly in the ﬁne-tuning stage. Similarly, there is a
        vast body of literature in machine

        learning exploring different gating architectures. For example, highway
        networks [7] make use of

        residual connections around the gate block to allow better gradient
        backpropagation and slightly

        softer attention mechanisms. Although our experiments with residual
        connections have not provided

        any signiﬁcant performance improvement, future research will focus on
        this aspect to obtain a better

        training behaviour. Lastly, we note that with the advent of improved GPU
        computation power and

        memory, larger capacity 3D models can be trained with larger batch sizes
        without the need for image

        downsampling. In this way, we would not need to utilise ad-hoc
        post-processing techniques to further

        improve the state-of-the-art results. Similarly, the performance of
        Attention U-Net can be further

        enhanced by utilising ﬁne resolution input batches without additional
        heuristics. Lastly, we would

        like to thank to Salim Arslan and Dan Busbridge for their helpful
        comments on this work.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-MiniLM-L6-v2
Maximum Sequence Length: 256 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Pravallika6/cross-domain-embeddings")
# Run inference
sentences = [
    'Are there any frameworks that adapt to different types of image segmentation tasks?',
    '[ABSTRACT]\nWe propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models\ntrained with AGs implicitly learn to suppress irrelevant regions in an input image\nwhile highlighting salient features useful for a speciﬁc task. This enables us to\neliminate the necessity of using explicit external tissue/organ localisation modules\nof cascaded convolutional neural networks (CNNs). AGs can be easily integrated\ninto standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy.\nThe proposed Attention U-Net architecture is evaluated on two large CT abdominal\ndatasets for multi-class image segmentation. Experimental results show that AGs\nconsistently improve the prediction performance of U-Net across different datasets\nand training sizes while preserving computational efﬁciency. The source code for\nthe proposed architecture is publicly available.\n\n[1 Introduction]\nAutomated medical image segmentation has been extensively studied in the image analysis community\ndue to the fact that manual, dense labelling of large amounts of medical images is a tedious and\nerror-prone task. Accurate and reliable solutions are desired to increase clinical work ﬂow efﬁciency\nand support decision making through fast and automatic extraction of quantitative measurements.\nWith the advent of convolutional neural networks (CNNs), near-radiologist level performance can\nbe achieved in automated medical image analysis tasks including cardiac MR segmentation [3] and\ncancerous lung nodule detection [17]. High representation power, fast inference, and ﬁlter sharing\nproperties have made CNNs the de facto standard for image segmentation. Fully convolutional\nnetworks (FCNs) [ 18] and the U-Net [ 24] are two commonly used architectures. Despite their\ngood representational power, these architectures rely on multi-stage cascaded CNNs when the target\norgans show large inter-patient variation in terms of shape and size. Cascaded frameworks extract a\nregion of interest (ROI) and make dense predictions on that particular ROI. The application areas\ninclude cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27] segmentation, and lung CT\nnodule detection [17]. However, this approach leads to excessive and redundant use of computational\nresources and model parameters; for instance, similar low-level features are repeatedly extracted by\nall models within the cascade. To address this general problem, we propose a simple and yet effective\nsolution, namely attention gates(AGs). CNN models with AGs can be trained from scratch in a\nstandard way similar to the training of a FCN model, and AGs automatically learn to focus on target\n1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\narXiv:1804.03999v3  [cs.CV]  20 May 2018\nstructures without additional supervision. At test time, these gates generate soft region proposals\nimplicitly on-the-ﬂy and highlight salient features useful for a speciﬁc task. Moreover, they do not\nintroduce signiﬁcant computational overhead and do not require a large number of model parameters\nas in the case of multi-model frameworks. In return, the proposed AGs improve model sensitivity and\naccuracy for dense label predictions by suppressing feature activations in irrelevant regions. In this\nway, the necessity of using an external organ localisation model can be eliminated while maintaining\nthe high prediction accuracy. Similar attention mechanisms have been proposed for natural image\nclassiﬁcation [11] and captioning [1] to perform adaptive feature pooling, where model predictions\nare conditioned only on a subset of selected image regions. In this paper, we generalise this design\nand propose image-grid based gating that allows attention coefﬁcients to be speciﬁc to local regions.\nMoreover, our approach can be used for attention-based dense predictions.\nWe demonstrate the implementation of AG in a standard U-Net architecture (Attention U-Net) and\napply it to medical images. We choose the challenging CT pancreas segmentation problem to provide\nexperimental evidence for our proposed contributions. This problem constitutes a difﬁcult task due to\nlow tissue contrast and large variability in organ shape and size. We evaluate our implementation on\ntwo commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class abdominal CT-150.\nThe results show that AGs consistenly improve prediction accuracy across different datasets and\ntraining sizes while achieving state-of-the-art performance without requiring multiple CNN models.\n\n[2 Methodology]\nFully Convolutional Network (FCN):Convolutional neural networks (CNNs) outperform traditional approaches in medical image analysis on public benchmark datasets [14, 17] while being an\norder of magnitude faster than, e.g., graph-cut and multi-atlas segmentation techniques [34]. This\nis mainly attributed to the fact that (I) domain speciﬁc image features are learnt using stochastic\ngradient descent (SGD) optimisation, (II) learnt kernels are shared across all pixels, and (III) image\nconvolution operations exploit the structural information in medical images well. In particular, fully\nconvolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13] and holistically nested\nnetworks [16, 35] have been shown to achieve robust and accurate performance in various tasks\nincluding cardiac MR [3], brain tumours [12] and abdominal CT [26, 27] image segmentation tasks.\nConvolutional layers progressively extract higher dimensional image representations ( xl) by processing local information layer by layer. Eventually, this separates pixels in a high dimensional\nspace according to their semantics. Through this sequential process, model predictions are conditioned on information collected from a large receptive ﬁeld. Hence, feature-map xl is obtained\nat the output of layer lby sequentially applying a linear transformation followed by a non-linear\nactivation function. It is often chosen as rectiﬁed linear unit: σ1 ( xl\ni,c) = max(0 ,xl\ni,c) where iand\nc denote spatial and channel dimensions respectively. Feature activations can be formulated as:\nxl\nc = σ1\n(∑\nc′∈Fl\nxl−1\nc′ ∗kc′,c\n)\nwhere ∗denotes the convolution operation, and the spatial subscript\n(i) is omitted in the formulation for notational clarity. The function f( xl; Φ l) = x(l+1) applied in\nconvolution layer lis characterised by trainable kernel parameters Φ l. The parameters are learnt by\n\nAttention GateLorem ipsum dolor sit amet,consectetur adipisicing elit, seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.\n x x x \n x x x \nReLU \nx \n \nSigmoid Resampler\nx x \nFigure 2: Schematic of the proposed additive attention gate (AG). Input features ( xl) are scaled\nwith attention coefﬁcients (α) computed in AG. Spatial regions are selected by analysing both the\nactivations and contextual information provided by the gating signal (g) which is collected from a\ncoarser scale. Grid resampling of attention coefﬁcients is done using trilinear interpolation.\nminimising a training objective, e.g. cross-entropy loss, using stochastic gradient descent (SGD).\nIn this paper, we build our attention model on top of a standard U-Net architecture. U-Nets are\ncommonly used for image segmentation tasks because of their good performance and efﬁcient use\nof GPU memory. The latter advantage is mainly linked to extraction of image features at multiple\nimage scales. Coarse feature-maps capture contextual information and highlight the category and\nlocation of foreground objects. Feature-maps extracted at multiple scales are later merged through\nskip connections to combine coarse- and ﬁne-level dense predictions as shown in Figure 1.\nAttention Gates for Image Analysis:To capture a sufﬁciently large receptive ﬁeld and thus, semantic contextual information, the feature-map grid is gradually downsampled in standard CNN\narchitectures. In this way, features on the coarse spatial grid level model location and relationship\nbetween tissues at global scale. However, it remains difﬁcult to reduce false-positive predictions for\nsmall objects that show large shape variability. In order to improve the accuracy, current segmentation\nframeworks [14, 26, 27] rely on additional preceding object localisation models to simplify the task\ninto separate localisation and subsequent segmentation steps. Here, we demonstrate that the same\nobjective can be achieved by integrating attention gates (AGs) in a standard CNN model. This does\nnot require the training of multiple models and a large number of extra model parameters. In contrast\nto the localisation model in multi-stage CNNs, AGs progressively suppress feature responses in\nirrelevant background regions without the requirement to crop a ROI between networks.\nAttention coefﬁcients, αi ∈[0,1], identify salient image regions and prune feature responses to\npreserve only the activations relevant to the speciﬁc task as shown in Figure 3a. The output of AGs is\nthe element-wise multiplication of input feature-maps and attention coefﬁcients: ˆxl\ni,c = xl\ni,c ·αl\ni. In\na default setting, a single scalar attention value is computed for each pixel vector xl\ni ∈RFl where\nFl corresponds to the number of feature-maps in layer l. In case of multiple semantic classes, we\npropose to learn multi-dimensional attention coefﬁcients. This is inspired by [ 29], where multidimensional attention coefﬁcients are used to learn sentence embeddings. Thus, each AG learns to\nfocus on a subset of target structures. As shown in Figure 2, a gating vector gi ∈RFg is used for\neach pixel ito determine focus regions. The gating vector contains contextual information to prune\nlower-level feature responses as suggested in [32], which uses AGs for natural image classiﬁcation.\nWe use additive attention [2] to obtain the gating coefﬁcient. Although this is computationally more\nexpensive, it has experimentally shown to achieve higher accuracy than multiplicative attention [19].\nAdditive attention is formulated as follows:\nql\natt = ψT (\nσ1 ( WT\nx xl\ni + WT\ng gi + bg)\n)\n+ bψ (1)\nαl\ni = σ2( ql\natt(xl\ni, gi; Θatt) ), (2)\nwhere σ2(xi,c) = 1\n1+exp(−xi,c) correspond to sigmoid activation function. AG is characterised\nby a set of parameters Θatt containing: linear transformations Wx ∈RFl×Fint, Wg ∈RFg×Fint,\nψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations are computed using\nchannel-wise 1x1x1 convolutions for the input tensors. In other contexts [33], this is referred to as\nvector concatenation-based attention, where the concatenated features xl and gare linearly mapped\nto a RFint dimensional intermediate space. In image captioning [1] and classiﬁcation [11] tasks, the\n\nFigure 3(a): From left to right (a-e, f-j): Axial and sagittal views of a\n3D abdominal CT scan, attention coefﬁcients, feature activations of\na skip connection before and after gating. Similarly, (k-n) visualise\nthe gating on a coarse scale skip connection. The ﬁltered feature\nactivations (d-e, i-j) are collected from multiple AGs, where a subset\nof organs is selected by each gate. Activations shown in (d-e, i-j)\nconsistently correspond to speciﬁc structures across different scans.\nFigure 3(b): The ground-truth pancreas\nsegmentation (a) is highlighted in blue\n(b). Similarly, U-Net model prediction\n(c) and the predictions obtained with Attention U-Net (d) are shown. The missed\ndense predictions by U-Net are highlighted with red arrows.\nsoftmax activation function is used to normalise the attention coefﬁcients (σ2); however, sequential\nuse of softmax yields sparser activations at the output. For this reason, we choose a sigmoid activation\nfunction. This results experimentally in better training convergence for the AG parameters. In contrast\nto [11] we propose a grid-attention technique. In this case, gating signal is not a global single vector\nfor all image pixels but a grid signal conditioned to image spatial information. More importantly, the\ngating signal for each skip connection aggregates information from multiple imaging scales, as shown\nin Figure 1, which increases the grid-resolution of the query signal and achieve better performance.\nLastly, we would like to note that AG parameters can be trained with the standard back-propagation\nupdates without a need for sampling based update methods used in hard-attention [21].\nAttention Gates in U-Net Model:The proposed AGs are incorporated into the standard U-Net\narchitecture to highlight salient features that are passed through the skip connections, see Figure\n1. Information extracted from coarse scale is used in gating to disambiguate irrelevant and noisy\nresponses in skip connections. This is performed right before the concatenation operation to merge\nonly relevant activations. Additionally, AGs ﬁlter the neuron activations during the forward pass as\nwell as during the backward pass. Gradients originating from background regions are down weighted\nduring the backward pass. This allows model parameters in shallower layers to be updated mostly\nbased on spatial regions that are relevant to a given task. The update rule for convolution parameters\nin layer l−1 can be formulated as follows:\n∂(ˆxl\ni)\n∂(Φl−1) = ∂\n(\nαl\nif(xl−1\ni ; Φl−1)\n)\n∂(Φl−1) = αl\ni\n∂(f(xl−1\ni ; Φl−1))\n∂(Φl−1) + ∂(αl\ni)\n∂(Φl−1) xl\ni (3)\nThe ﬁrst gradient term on the right-hand side is scaled with αl\ni. In case of multi-dimensional AGs, αl\ni\ncorresponds to a vector at each grid scale. In each sub-AG, complementary information is extracted\nand fused to deﬁne the output of skip connection. To reduce the number of trainable parameters\nand computational complexity of AGs, the linear transformations are performed without any spatial\nsupport (1x1x1 convolutions) and input feature-maps are downsampled to the resolution of gating\nsignal, similar to non-local blocks [ 33]. The corresponding linear transformations decouple the\nfeature-maps and map them to lower dimensional space for the gating operation. As suggested in\n[11], low-level feature-maps, i.e. the ﬁrst skip connections, are not used in the gating function since\nthey do not represent the input data in a high dimensional space. We use deep-supervision [16] to\nforce the intermediate feature-maps to be semantically discriminative at each image scale. This helps\nto ensure that attention units, at different scales, have an ability to inﬂuence the responses to a large\nrange of image foreground content. We therefore prevent dense predictions from being reconstructed\nfrom small subsets of skip connections.\n\n[4 Discussion And Conclusion]\nIn this paper, we presented a novel attention gate model applied to medical image segmentation. Our\napproach eliminates the necessity of applying an external object localisation model. The proposed\napproach is generic and modular as such it can be easily applied to image classiﬁcation and regression\nproblems as in the examples of natural image analysis and machine translation. Experimental\nresults demonstrate that the proposed AGs are highly beneﬁcial for tissue/organ identiﬁcation and\nlocalisation. This is particularly true for variable small size organs such as the pancreas, and similar\nbehaviour is expected for global classiﬁcation tasks.\nTraining behaviour of the AGs can beneﬁt from transfer learning and multi-stage training schemes.\nFor instance, pre-trained U-Net weights can be used to initialise the attention network, and gates can\nbe trained accordingly in the ﬁne-tuning stage. Similarly, there is a vast body of literature in machine\nlearning exploring different gating architectures. For example, highway networks [7] make use of\nresidual connections around the gate block to allow better gradient backpropagation and slightly\nsofter attention mechanisms. Although our experiments with residual connections have not provided\nany signiﬁcant performance improvement, future research will focus on this aspect to obtain a better\ntraining behaviour. Lastly, we note that with the advent of improved GPU computation power and\nmemory, larger capacity 3D models can be trained with larger batch sizes without the need for image\ndownsampling. In this way, we would not need to utilise ad-hoc post-processing techniques to further\nimprove the state-of-the-art results. Similarly, the performance of Attention U-Net can be further\nenhanced by utilising ﬁne resolution input batches without additional heuristics. Lastly, we would\nlike to thank to Salim Arslan and Dan Busbridge for their helpful comments on this work.',
    '[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoﬀmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saﬀron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoﬀrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After ﬁne-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adiﬀerentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scﬁt pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9d𝑖º. Neural\nnetworks have proven to be powerful language models, ﬁrst in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe beneﬁts of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring eﬃcient means of augmenting language\nmodels with a massive-scale memory without signiﬁcantly increasing computations. Speciﬁcally, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkistheﬁrsttoshowthebeneﬁts\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoﬀmann|sifre}@deepmind.com\narXiv:2112.04426v3  [cs.CL]  7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be ﬁne-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method ﬁrst\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simpliﬁed version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the ﬁrst chunk only aﬀect the last token of the ﬁrst chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we ﬁlter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-ﬂy is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We ﬁnd that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three diﬀerent residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classiﬁcation\nhead deﬁnes an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a diﬀerentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\x001º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we ﬁrst split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\x001 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\x001¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the ﬁrst\n𝑚\x001 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\x001 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\x001 , C/a.sc¹ℎ𝑢𝑚¸𝑖\x001\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the ﬁrst to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\x001º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢  A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢  C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢  F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻  E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻  A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻  C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻  F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is deﬁned in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we deﬁne\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe ﬁrst𝑚\x001 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndeﬁne C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \x001¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simpliﬁed implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\x001º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9d𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are ﬂexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly ﬁne-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more eﬃcient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeﬀStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close oﬀtheir noses, ears, and\nin the large pond oﬀ Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nﬂage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, ﬂ es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their ﬂat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof theﬁrst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their ﬂat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting diﬀerences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.4898, -0.0854],
#         [ 0.4898,  1.0000, -0.0389],
#         [-0.0854, -0.0389,  1.0000]])

Training Details

Training Dataset

Unnamed Dataset

Size: 2,970 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 8 tokens
mean: 25.32 tokens
max: 80 tokens

min: 170 tokens
mean: 252.85 tokens
max: 256 tokens

	sentence_0	sentence_1
type	string	string
details	min: 8 tokens mean: 25.32 tokens max: 80 tokens	min: 170 tokens mean: 252.85 tokens max: 256 tokens

Samples:

sentence_0	sentence_1
`What methods do language models use to predict mutation effects on proteins?`	[ABSTRACT] Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Pythonbased natural language augmentation framework which supports the creation of both transformations (modiﬁcations to the data) and ﬁlters (data splits according to speciﬁc features). We describe the framework and an initial set of 117 transformations and 23 ﬁlters for a variety of natural language tasks. We demonstrate the eﬃcacy of NL-Augmenter by using several of its tranformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robutstness analysis results are available publicly on the NL-Augmenter repository (https://github. com/GEM-benchmark/NL-Augmenter). [1 Introduction] Data augmentation, the act of creating new datapoints by slightly modifying copies or creating synthetic da...
`How can I efficiently model long sequences in machine learning?`	[ABSTRACT] Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efﬁciently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efﬁcient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(Llog L) in time complexity and memory usage, and has comparable performance on sequences’ dependency alignment. (ii) the self-attention distill...
`What methods exist for learning from interconnected datasets?`	[ABSTRACT] How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufﬁcient labeled molecules for supervised training; (2) poor generalization capability to new-synthesized molecules. To address them both, we propose a novel framework, GROVER, which stands for Graph Representation frOm selfsuperVised mEssage passing tRansformer. With carefully designed self-supervised tasks in node-, edge- and graph-level,GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. Rather, to encode such complex information, GROVER integrates Message Passing Networks into the Transformer-style architecture to deliver a class of more expressive encoders of molecules. The ﬂexibility of GROVER...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false,
    "directions": [
        "query_to_doc"
    ],
    "partition_mode": "joint",
    "hardness_mode": null,
    "hardness_strength": 0.0
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 32
per_device_eval_batch_size: 32
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

per_device_train_batch_size: 32
num_train_epochs: 3
max_steps: -1
learning_rate: 5e-05
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_steps: 0
optim: adamw_torch_fused
optim_args: None
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 1
average_tokens_across_devices: True
max_grad_norm: 1
label_smoothing_factor: 0.0
bf16: False
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
project: huggingface
trackio_space_id: trackio
eval_strategy: no
per_device_eval_batch_size: 32
prediction_loss_only: True
eval_on_start: False
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: None
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_on_each_node: False
enable_jit_checkpoint: False
push_to_hub: False
hub_private_repo: None
hub_model_id: None
hub_strategy: every_save
hub_always_push: False
hub_revision: None
load_best_model_at_end: False
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 42
data_seed: None
use_cpu: False
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_pin_memory: True
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: None
local_rank: -1
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Framework Versions

Python: 3.13.11
Sentence Transformers: 5.3.0
Transformers: 5.5.0
PyTorch: 2.11.0+cu130
Accelerate: 1.13.0
Datasets: 4.8.4
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}