metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2970
- loss:MultipleNegativesRankingLoss
base_model: sentence-transformers/all-MiniLM-L6-v2
widget:
- source_sentence: Can someone explain how to model dependencies in relational information?
sentences:
- >-
[Preamble]
Rodr´ ıguez et al.
Generalized Boozer coordinates: a natural coordinate system for
quasisymmetry
E. Rodr ´ ıguez,1, 2,a) W. Sengupta,1, 2 and A. Bhattacharjee 1, 2
1)Department of Astrophysical Sciences, Princeton University, Princeton,
NJ,
2)Princeton Plasma Physics Laboratory, Princeton, NJ, 08540
(Dated: 22 September 2021)
We prove the existence of a straight-field-line coordinate system we call
generalized Boozer coordinates. This
coordinate system exists for magnetic fields with nested toroidal flux
surfaces provided
∮
dl/B(j ·∇ψ) = 0,
where symbols have their usual meaning, and the integral is taken along
closed magnetic field lines. All
quasisymmetric fields, regardless of their associated form of equilibria,
must satisfy this condition. This
coordinate system presents itself as a convenient form in which to
describe general quasisymmetric configurations and their properties.
Insight can be gained analytically into the difference between strong and
weak
forms of quasisymmetry, as well as axisymmetry, and the interaction of
quasisymmetry with different forms
of equilibria.
I. INTRODUCTION:
Quasisymmetric stellarators1–3 (sometimes referred to
as helically-symmetric4,5) constitute an attractive choice
for magnetic confinement fusion. Theoretically, such
designs exhibit transport properties analogous to those
of axisymmetric devices 6 while possessing larger threedimensional
freedom. The latter allows avoiding some
of the inherent limitations of tokamaks. The quasisymmetric stellarator
achieves this by possessing a hidden
symmetry, namely, the magnitude of the magnetic field
|B|is symmetric, while the full vector B is not.
The concept of quasisymmetry (QS) is elegant, but
it appears to have a significant theoretical limitation. It
was soon realized7 that constructing a configuration with
exact QS is not possible. Although there is no definitive
proof that shows this impossibility, work on near-axis
expansions7,8 supports this point of view. The governing
system of equations is overdetermined: there are more
constraints than degrees of freedom. This limitation does
not, however, prevent designs that exhibit behavior close
to QS in a volume. 9,10
Recently, studies that explore the concept of QS more
deeply have appeared.3,11 The main idea has been to separate the concept
of QS from that of macroscopic force
balance. In these studies, QS is defined as a property of
the configuration that confers the single-particle dynamics an
approximately conserved quantity without making any statement about the
form of macroscopic equilibrium. This perspective differs significantly
from the
standard approach. Prior to [3] and [12] (and with very
few exceptions5,13), the concept of QS was framed in the
context of magnetohydrostatic (MHS) equilibria satisfying j ×B = ∇p,
where j = ∇×B is the current density
and p is the scalar pressure. As MHS is not intrinsic to
a)Author to whom correspondence should be addressed:
eduardor@princeton.edu
QS, it is important to define QS without reference to a
particular form of equilibrium. Separating QS from equilibrium allows us
to investigate more deeply its meaning
and limitations.
Abandoning the convenient form of MHS equilibrium,
although conceptually appropriate, comes at a cost. The
magnetic field no longer needs to satisfy the condition j ·∇p = 0 ,
implicitly assumed in most of the
literature1,4,6. Hence, one cannot guarantee the existence
of Boozer coordinates 2 as presently understood, even if
magnetic flux surfaces exist. Boozer coordinates are particularly
convenient for studying QS, as it presents the
symmetry in an explicit, simple form. This leads to a
search for an analogous, convenient, but more general
coordinate system for quasisymmetric configurations.
In this paper, we construct such a coordinate system,
which we call generalized Boozer coordinates (GBC).
This coordinate system was used to formulate the nearaxis expansion in
[14], in which a QS equilibrium with
anisotropic pressure was shown to avoid the conventional
problem of overdetermination. The present paper is organized as follows.
First, we introduce, develop and discuss
this coordinate system systematically and rigorously. We
start by presenting a constructive proof for GBC and the
class of fields for which it exists. We then present the set
of equations describing a quasisymmetric magnetic field
in this coordinate system. This gives us the opportunity
to gain an alternative 3,12 perspective on the distinction
between the so-called weak and strong forms of quasisymmetry, as well as
a comparison to axisymmetry. We close
by summarizing the equations that link the equilibrium
and the quasisymmetric field and concluding remarks.
II. GENERALIZED BOOZER COORDINATES
A. Explicitly symmetric formulation of quasisymmetry
Let us start this section by introducing the notion of
QS. We consider QS from the recent general perspective
arXiv:2106.08266v2 [physics.plasm-ph] 21 Sep 2021
Rodr´ ıguez et al. 2
of single particle motion. 3,11 QS (and in particular weak
QS) is the property of the fields that confers the motion
of guiding-centre single particles with an approximately
conserved quantity.3 For the dynamics of particles to exhibit this
conservation it is necessary for the magnetic
field to have nested flux surfaces B ·∇ψ= 0 and satisfy,
∇ψ×∇B·∇(B ·∇B) = 0. (1)
This form of quasisymmetry is most commonly referred
to as the triple vector product formulation. Although
this is not the form that comes directly from the singleparticle
analysis (that form is the one used in Sec. IIIA),
it is a succinct way to impose QS on a magnetic field.
Given that this generalized concept of QS has a single
particle origin, no notion of macroscopic equilibrium is
involved. Of course, from a practical point of view, any
steady field of interest will be in some form of force balance. With the
definition of QS in (1), different forms of
equilibria may be investigated to understand how they
interact with QS. One may generally refer to the macroscopic forces by
F, and we do not attempt to assess their
origin in this paper. Instead, we focus on the requirements at a fluid
level. A complete view of the problem would require an investigation of
the kinetic basis
of the forces to link the fluid forces to microscopic behavior. This is
particularly important as microscopic
forces could break the symmetry (and with it, the QSrelated conserved
momenta). An important example is
the electrostatic potential, whose symmetry is needed to
the appropriate order and imposes constraints even on
the forces that arise from the plasma, such as centrifugal
force or anisotropic pressure. With this in mind, we shall
focus on the macroscopic aspects.
Looking at the statement of QS in the form presented
in Eq. (1), the existence of a symmetry in the problem
is not apparent. In the usual context of j ×B = ∇p,
this absence of obvious symmetry can be amended by
using Boozer coordinates. In these coordinates, a field in
magnetostatic equilibrium with well-defined flux surfaces
is quasisymmetric (aside from quasipoloidal symmetry)
iff,
B = B(ψ,θ −˜αφ), (2)
where {ψ,θ,φ }are Boozer coordinates, ψthe flux surface
label, θ and φ poloidal and toroidal angles respectively,
and ˜α = N/M|M,N ∈Z. In order to employ Boozer
coordinates, the j·∇ψ= 0 property of MHS equilibria is
central. Boozer coordinates have the standard straight
field-line coordinate Jacobian
J= Bφ + ιBθ
B2 , (3)
where the covariant Bθ and Bφ are flux functions,
B = Bθ(ψ)∇θ+ Bφ(ψ)∇φ+ Bψ∇ψ.
Boozer coordinates are widely used in stellarator theory
and applications. The coordinates are a natural set that
simplify many of the governing equations, including QS.
In particular, construction of solutions by near-axis expansion of
three-dimensional fields is most convenient in
these coordinates7,8,14–16.
In the context of our general quasisymmetric B, it is
not necessary to assume that j ·∇ψ = 0. However, we
demonstrate that any quasisymmetric field must satisfy∮
(dl/B)(j ·∇ψ) = 0 (see Appendix A). The question
that naturally arises is, can we construct an appropriate straight field
line coordinate system that explicitly
expresses the symmetry in QS in the Boozer fashion?
The answer is yes. To see this, we write Eq. (1) in
straight-field line coordinates {ψ,θ,φ } using the contravariant
representation B = ∇ψ×(∇θ−ι∇φ),
(∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB)·
·∇(J−1(∂φB+ ι∂θB)) = 0,
where J is the Jacobian associated to the coordinate
system chosen. Now assume we can construct a straightfield line
coordinate system with a Jacobian of the form
J= J(ψ,B), then the Jacobian factor may be dropped
from the equation above to obtain,
(∇ψ×∇θ∂θB+ ∇ψ×∇φ∂φB) ·∇(∂φB+ ι∂θB) = 0
i.e.
[
∂φ −
(∂φB
∂θB
)
∂θ
]
(∂φ + ι∂θ)B = 0
assuming that ∂θB ̸= 0. (This makes quasi-poloidally
symmetric solutions a special case.) From near-axis expansion we know
that these solutions have a very restricted QS region3.
Commuting operators, we obtain
(∂φ + ι∂θ)
(∂φB
∂θB
)
∂θB = (b ·∇)
(∂φB
∂θB
)
∂θB = 0
which implies that B = B(ψ,θ −˜αφ) or B = B(ψ,φ),
where ˜α= −∂φB/∂θB is a flux function. The additional
requirement that ˜α is rational to avoid B = B(ψ) at
non-degenerate surfaces requires ˜α to be constant.3
In summary, if we are able to construct a straight field
line coordinate system that has a Jacobian that can be
written as a function of ψ and B only, then a field is QS
in the weak sense if and only if B depends on a single
linear combination of those coordinate angles. Note that
under this assumption, the reverse direction of the proof
is straightforward.
B. Constructing generalized Boozer coordinates
We define generalized Boozer coordinates(GBC) as a
set of straight field line coordinates whose Jacobian can
be written in the form J= Bα(ψ)/B2, where Bα is some
flux function without requiring j ·∇ψ to vanish identically. This choice
is more general than what is permitted
by Boozer coordinates, which separately requires Bθ and
Rodr´ ıguez et al. 3
Bφ to be flux functions. We shall now constructively explore the
conditions under which such a coordinate system exists.
Let us start with a given magnetic field, assuming that
it lies on well-defined flux surfaces. It then follows 2,17,18
that there exists some initial straight field line coordinate
system {ψ,θ,φ }, so that B = ∇ψ×∇θ+ι∇φ×∇ψ. The
definition of the angular straight field line coordinates is
arbitrary up to a transformation of the form
θ= θ′+ ιω,
φ= φ′+ ω.
The function ω = ω(ψ,θ,φ ) is a well behaved periodic
function that defines a family of transformations 18. The
periodicity of ω preserves the toroidal and poloidal character of the
two angular coordinates.
Starting with some given straight-field line coordinate
system, we want to understand how to construct a set
with a Jacobian of the GBC form. Thus, we need to
know how to transform the coordinate Jacobian induced
by ω. This reads
J′−1 = ∇ψ×∇(θ−ιω) ·∇(φ−ω).
Here {ψ,θ′,φ′}represents the newly defined straight field
line coordinates whose associated Jacobian is J′. This
equation may be recast into the form of a magnetic differential
equation,
B ·∇ω= 1
J − 1
J′. (4)
Now, let us require19
J′= Bα(ψ)
B2 . (5)
In order for the magnetic differential equation Eq. (4)
to have a single-valued solution for the transformation
function ω, Eq. (4) must satisfy Newcomb’s criterion 20.
According to Newcomb, for a magnetic differential equation B ·∇f = s to
have a single-valued solution for f,
the source term smust satisfy the line-integral condition
along a field line,
∮
sdl
B = 0. (6)
For our problem s= 1/J− 1/J′. Start with,
∮ 1
J
dl
B =
∮
B ·∇φdl
B =
∮
dφ= 2πn.
Here we have used the definition of the Jacobian in terms
of the magnetic fieldB·∇φ= ∇ψ×∇θ·∇φ= 1/J, where
φ and θ are, respectively, toroidal and poloidal angular
coordinates that increase by 2π in going around the long
or short toroidal circular paths. We also considered the
rotational transform ι = n/m, where n,m ∈Z and are
coprime.
FIG. 1. Ribbon surface. Ribbon defined by two closely
lying rational magnetic field lines labelled by C1 and C2.
Now, consider Newcomb’s condition on the last term
on the right-hand-side of Eq. (4), and define the closed
line integral Iso that,
∮ 1
J′
dl
B = I
Bα(ψ). (7)
For Newcomb’s condition on Eq. (4) to hold, the following
must be true:
Bα(ψ) = I
2πn. (8)
With this choice
∮
B ·∇ωdl
B = 0. (9)
Then there must exist a single-valued solution ωthat enacts the
coordinate transformation into the set associated
with the Jacobian (5).
For Eq. (8) to hold, it is necessary for Ito be a flux
function. This condition may be written in the form,
I(ψ) =
∮
B ·dr, (10)
along a magnetic field line. Focus on the case of rational
field lines, for which the condition is most stringent. To
proceed further, consider a non-self-intersecting ribbon
over a flux surface bounded by two adjacent magnetic
field lines (see Fig. 1). Denote the line integrals along
each of these lines by C1 and C2. We may then write,
using Stoke’s theorem,
∮
C1
B ·dr −
∮
C2
B ·dr =
∫
rib
j ·dS, (11)
where the surface element is taken to be perpendicular
to the flux surface. The integral over the surface may be
written as,18
∫
rib
j ·dS =
∫ α0+δα
α0
dα
∮ dl
Bj ·∇ψ.
Here α labels the field lines on the surface. Now, if Iis
to be truly a flux function, then following (11) the last
Rodr´ ıguez et al. 4
surface integral must vanish. And it must do so for all
field line labels α0 and adjacent field lines δα. This gives
the necessary and sufficient condition,
∮ dl
Bj ·∇ψ= 0. (12)
This subclass of magnetic fields grant the required form
of I, and thus the single-valued solution to Eq. (4). This
subclass includes the stronger constraint j ·∇ψ= 0 (for
which the coordinates reduce to Boozer coordinates), or
more to our concern, the QS scenario (see Appendix A).
Note that magnetic fields that possess nested surfaces
have the property ⟨j ·∇ψ⟩= 0, following an application
of ∇·(B ×∇ψ). However, the Newcomb condition (12)
is more stringent.20
Restricting ourselves to this subclass of fields, we can
always choose Bα as in (8). This choice might appear
artificial, especially given the presence of the discrete n.
However, we may relate this to the surface average over
irrational flux surfaces. To see this, we write,
Bα = 1
2πn
∮
B2 dl
B = 1
4π2n
∫ 2π
dα
∮
B2 dl
B
n turns
=
= 1
4π2
∫ 2π
dα
∮
B2 dl
B
1 turn
= V′
4π2 ⟨B2⟩, (13)
where V′ =
∫2π
0 dα
∫
dl/B = V′⟨1⟩is the usual volume
ψ derivative. Bα is, therefore, a well behaved quantity
for both rational and irrational surfaces.
As it stands, with an appropriate choice of Bα, and
restricting ourselves to the subclass of fields satisfying
Eq. (12), one may perform a coordinate transformation
that provides a Jacobian of the desired form (5). It remains to be shown
that this new coordinate system is
well-behaved. By this we mean that the Jacobian does
not vanish nor diverge. To see this, it is most useful to
rewrite J′in terms of (13),
J′= Bα
B2 = V′
4π2
⟨B2⟩
B2 . (14)
It is clear that J′has a definite sign, as given by the sign
of V′. Thus, the Jacobian will never vanish nor diverge,
given that B2 >0.
C. Magnetic field in generalized Boozer coordinates
We have shown that under the assumption that the
magnetic field is quasisymmetric, we have a straightfield-line coordinate
system, GBC, with a Jacobian J =
Bα(ψ)/B2. In this coordinate system and following
Sec. II A, a quasisymmetric field is one whose magnitude
can be expressed in GBC as B = B(ψ,θ −˜αφ). This
form is analogous to the Boozer formulation of QS but
is only subordinated to the configuration being QS and
not j ·∇ψ= 0.
Before proceeding to analyze some of the properties of
QS and other governing equations in GBC, we explicitly
write the covariant and contravariant forms for B. The
covariant form is
B = Bθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ. (15)
The usual covariant function Bφ has been deprecated for
Bα. This is the flux function that appears in the Jacobian
(5). The simplicity of (15) shows why it was convenient
to choose the Jacobian of GBC to have the particular
form in (5). To obtain (15), it is sufficient to take its
scalar product with the contravariant representation,
B = ∇ψ×∇θ+ ι∇φ×∇ψ, (16)
and capitalize on the definition of J. Compared to
Boozer coordinates, the covariant function Bθ in GBC is
not necessarily a flux function. Thus, GBC is an extension of Boozer
coordinates. When j ·∇ψ = 0, however,
and as previously pointed, GBCs reduce to Boozer coordinates. So far,
the forms in (15) and (16) have only
required of the existence of GBC, and not of QS per se
–other than to guarantee (12). To enforce the latter, one
needs to specify |B|as a symmetric function. 14,16,21
III. DESCRIBING WEAK QUASISYMMETRY IN GBC
Having developed GBC, let us see what this coordinate
system can teach us about weak QS. We first write down
the complete set of equations that describe a weakly quasisymmetric
magnetic field. The first relevant equation
equates the covariant (15) and contravariant (16) forms
of the magnetic field. The equation reads,
Bθ∇θ+ (Bα −ιBθ)∇φ+ Bψ∇ψ=
= ∇ψ×∇θ+ ι∇φ×∇ψ. (17)
To specify that we are using GBC, and to introduce the
quasisymmetric condition, we require
∇ψ×∇θ·∇φ= Bα(ψ)
B2(ψ,θ −˜αφ). (18)
This set of equations (17) and (18) is entirely selfcontained. It
describes a general magnetic field (a vector field satisfying ∇· B = 0 = B
·∇ψ, without a particular form of equilibrium) that is quasisymmetric —
no more, no less. Such equations have been recently
used in near-axis expansions with anisotropic pressure
equilibria.14,16 Equations (17) and (18), referred to as
the co(ntra)variant and Jacobian equations respectively,
were there expanded systematically. A more thorough
and complete exploration of the expansion of the magnetic equations and
its implications will be presented in
a separate publication.
Rodr´ ıguez et al. 5
Equations (17) and (18) apply beyond near-axis expansions.
Alternatively, one could study the behavior
of this system in a perturbative sense but around a surface rather than
the axis. 21 This could shed some light
to standard optimization approaches to QS 9,10.
Beyond the set of equations (17) and (18), the formulation of weak QS in
terms of GBC can be used to compare
weak QS to other (quasi)symmetric forms. This comparison to strong and
axisymmetry will help frame the notion
of weak QS in the larger space of configurations.
A. Comparison to strong quasisymmetry
Strong QS is a more resrictive form of QS compared to
its weak form3,12. Weak QS is the necessary and sufficient
condition for the leading guiding centre dynamics to have
an approximately conserved momentum to leading gyrocentre order.3 This
condition can be written as in (1),
but is most naturally given as,
u ·∇B = 0, (19)
B ×u = ∇Φ(ψ), (20)
∇·u = 0. (21)
The vector fieldu is defined by these equations and points
in the direction of symmetry. In strong QS the conserved momentum for
the particle dynamics is exact for
the first-order guiding centre Lagrangian.3,5,12 In the notation that
explicitly introduces the symmetry vector u,
strong QS is equivalent to weak QS – i.e., Eqs. (19)-(21)–
plus the constraint
j ×u + ∇C = 0, (22)
where C = B·u. Note that only the b component of this
additional constraint is contained in the weak formulation of the
problem. We now explore the significance of
(22) in the context of GBC.
To begin, we need to construct u. From Eqs. (19)(21),3
u = ¯ι∇ψ×∇χ
B ·∇χ , (23)
where χ= θ−˜αφ, and it is convenient to use χas part of
the coordinate triplet {ψ,χ,φ }. We have made the choice
Φ′= ¯ι so that u ·∇φ = 1, for simplicity. Scaling of the
flux label for u leaves the weak QS conditions unchanged
(see Appendix B for further discussion on the gauge and
the particular choice). This form of u is equivalent to
Eqs. (19)-(21) only if we enforce B = B(ψ,χ) and the
coorrdinate system B ·∇χ= ¯ιJ−1 with the Jacobian in
(5). The parameter ¯ι= ι−˜α has been defined.
From (15) and (23), we find,
C = Bα −¯ιBθ. (24)
This relation (24), together with (23) and the definition
of C can be used to write the gauge-independent form,
B ·∇ψ×∇B = Bα −¯ιBθ
¯ι B ·∇B. (25)
The magnetohydrostatic form of this equation has been
used previously18,22.
The contravariant form of u together with (16) give
simple expressions for directional derivatives in GBC.
The differential operators can be simply written as partial derivatives,
B ·∇ = J−1 (¯ι∂χ + ∂φ) , u ·∇ = ∂φ. (26)
Since the Jacobian is quasisymmetric from (5), the two
operators B·∇and u·∇commute with each other11. This
commutation property is made manifest in the GBC.
The symmetry field u can also be written in the following covariant form:
u = uψ∇ψ+ uχ∇χ+ uφ∇φ. (27)
Taking the scalar product with u and B we obtain
uφ = u2, ¯ιuχ + uφ = C
J−1 . (28)
To complete the family of vectors required for the
strong quasisymmetric condition (22), we need a closed
form for j in GBC. From the curl of the covariant form
of B in Eq. (15), we obtain
j = ∇Bψ×∇ψ+ ∇Bθ×∇χ+ ∇(Bα−¯ιBθ) ×∇φ. (29)
Using (29) and (24), we can show that
j×u+∇C = (u·∇Bψ)∇ψ+ (u·∇Bθ) (∇χ−¯ι∇φ) .
(30)
The simplicity of (30) is due to the choice of GBC.
Recall that strong QS requires the expression j ×u +
∇C to be identically zero. This means that all the terms
on the right side of (30) need to vanish. That is to say, the
covariant functions (Bψ,Bθ) are required to be quasisymmetric. If Bθ is
quasisymmetric, then C is automatically
so from (24). In an explicit coordinate representation,
using (26), we may write Bθ(ψ,χ) and Bψ(ψ,χ).
Thus, the GBC representation provides an elegant way
to formulate strong QS, which can now be understood as
weak quasisymmetry plus the conditions that Bψ and Bθ
are QS. In other words, not only is B QS but so are Bθ
and Bψ.
a. Implications for near-axis expansion . We refer
to [14] and [16] for a detailed treatment using near-axis
expansions of the weakly quasisymmetric problem. The
procedure is based on expanding all governing equations
describing the weak quasisymmetric field in some form of
equilibrium in powers of the distance from the magnetic
axis. To do so efficiently, both the field and equations
–including Eqs. (17) and (18)– are expressed in GBC. It
was shown there that when equilibria with anisotropic
pressure are considered, the common overdetermination
problem7,8 that limits the expansion is overcome. The
number of governing equations becomes the same as that
of functions to be solved.
Rodr´ ıguez et al. 6
Extending the expansion to strong QS is straightforward. From the
discussion above, the only difference
is that the covariant functions Bθ and Bψ are both
quasisymmetric rather than general functions of space.
In practical terms, this simply leads to more restricted
Taylor-Fourier expansions of those functions; the coefficients that were
functions of φ become constants. This
restriction in freedom once again leads us back to the
Garren-Boozer overdetermination problem. In fact, it
does so the same way as it did in the case of MHS equilibrium. The
restriction on the covariant functions imposes
very severe constraints on the allowed geometry. The
only way to escape this impasse is to assume axisymmetry (φ
independent). Once again, consistent with what
we observed in [14], the asymmetry of the covariant representation of B
appears to be vital to the construction
of QS solutions.
B. Comparison to axisymmetry
We have seen that the strong formulation of QS is more
constraining than its weak form. We would also like to
compare QS to the limiting case of axisymmetry. We
shall think of this case as a symmetry generated by rotation in space:
the system is invariant under rotations
about an axis. In Euclidean space, a rotation is an isometry, and it is
generated by a vector field known as a
Killing vector. Using the notion of a Killing vector, we
want to explore how ‘far’ the weak concept of QS is from
this ‘true’ symmetry.
A measure of the departure of a symmetry generator from a Killing vector
is the so-called deformation
metric.12 Taking u to represent the symmetry vector for
QS, the idea is to see how far it is from being a Killing
vector. A vector field v is Killing if and only if the deformation tensor
Lvg= 0. Here Lu denotes the Lie derivative along u and g is the
Euclidean metric. In 3D, this
may be written as,
Lug= ∇u + (∇u)T. (31)
Evaluating this tensor for a quasisymmetric configuration
should then provide information regarding the closeness
to an isometry. It is convenient12 to evaluate this rank-2
tensor in a basis defined by {B,u,∇ψ}, a triplet that we
shall take to be independent. Then,
[
∇u + (∇u)T]
·B =j ×u + ∇C, (32)
[
∇u + (∇u)T]
·u =∇u2 −u ×∇× u, (33)
[
∇u + (∇u)T]
·∇ψ=∇ψ×∇× u + 2∇ψ·∇u, (34)
where the weak quasisymmetric properties have been
used where necessary. Equation (33) is what Burby et al.
call w.11 We shall explore this vector w in more detail
after obtaining an explicit form for Lug. Using (32)-(34),
and projecting once again onto the non-orthogonal basis
triplet, we obtain
Lug=
0 u ·∇C ∇ψ·(j ×u + ∇C)
... u ·w w ·∇ψ
... ... −u ·∇|∇ψ|2
. (35)
The matrix is symmetric by construction. The content
of its elements can be made clearer using GBC explicitly.
The top row, corresponding to (32), has already been
dealt with, as it is precisely the piece corresponding to
strong QS. We made the observation that for this condition to be
satisfied, (30) required u·∇Bθ = 0 = u·∇Bψ.
For the other components, the vector field w = ωu ×
u + ∇u2, ω u = ∇×u is key. Using the covariant form
of u we obtain the curl of the vector u in the form
ωu = ∇uψ ×∇ψ+ ∇uχ ×∇χ+ ∇uφ ×∇φ. (36)
Taking the cross product with u, using the orthogonality
of u with ∇ψ and ∇χ, and (24) and (28) we get
w = (u ·∇uψ)∇ψ+ (u ·∇u2)
(
∇φ−1
¯ι∇χ
)
+
−J(u ·∇Bθ)∇χ, (37)
which implies that
B ·w = −¯ιu ·∇Bθ, (38)
u ·w =u ·∇u2. (39)
Most importantly, a vanishing w implies that the covariant components of
the symmetry vector as well as Bθ are
quasisymmetric.
To complete the simplification of the metric tensor, we
invoke B2 = ( C2 + |∇ψ|2)/u2, which follows from the
definition of u. This means,
u ·∇|∇ψ|2 = B2u ·∇u2 −u ·∇C2. (40)
With this coordinate representation, the dependence of
the various metric pieces is made explicit. We may then
schematically present the dependence of Lug as follows,
Lug∼
0 ∂φBθ ∂φBθ, ∂φBψ
... ∂ φu2 ∂φu2, ∂φuψ
∂φBθ
... ... ∂φBθ, ∂φu2
. (41)
The boxed expressions are meant to indicate that the
corresponding tensor component vanishes if the expressions there do. If
the tensor (41) is to vanish, then the
symmetry vector would correspond to a rotation. This is
not surprising if one looks at what it means for the components in (41)
to vanish. Axisymmetry is reached when
the covariant components of the magnetic field and the
symmetry vector are themselves symmetric. The latter is
intimately related to the geometry, as we may see when
writing u ∝∂φx|χ,ψ.
Rodr´ ıguez et al. 7
From (41), it follows that in some sense, weak QS is
far from being an isometry. This is so because only one
of the components of the tensor exactly vanishes. The
φ dependence of the functions Bψ, Bθ, u2, and uψ takes
the configuration away from axisymmetry. These apparent four degrees of
freedom (especially those involving
u) may not be independent and involve highly non-linear
combinations—they should ultimately be related through
the quasisymmetric magnetic equations. Anyhow, the
field-line dependence is key in distinguishing the weakly
quasisymmetric form from, say, an axisymmetric tokamak.
To make a comparative measurement of the departure
from axisymmetry, consider now the case of strong QS.
In this case, following (22), the first whole row (and thus
also column) of (41) drop. The remaining dependence
also simplifies, and the system is precluded from being
an isometry, a priori, through the φ dependence of u2
and uψ only, which is consistent with the work by Burby
et al.12
Imposing additional properties to the field may also
affect the form of the deformation tensor. An example
would be a particular form of force balance. We now
explore how the magnetics and equilibria are linked.
IV. QUASISYMMETRY AND EQUILIBRIA
Let us consider the force balance part of the problem.
Generally, a magnetic equilibrium with some arbitrary
force F reads,
j ×B = F. (42)
As we argued in Sec. II, we are concerned in this work
with a general fluid force F. Its connection to the microphysics is not
considered. Let us express the left-hand
side of (42) in GBC. Using the contravariant form for
the current (29) together with (24), we obtain
j ×B =
[
B ·∇Bψ −J−1 (B′
α −¯ι′Bθ)
]
∇ψ+
+ (B ·∇Bθ) (∇χ−¯ι∇φ) , (43)
which is an explicit coordinate representation of j ×B.
The form of (43) mirrors the form of (30). In this case,
the magnetic differential operators substitute the directional
derivatives along u. We note that (43) does not
have any component along B, as can be checked by taking the dot product
with B.
The form of (43) puts constraints on the allowable
forms for F. As already noted, F ·B = 0 must hold
true. Otherwise the system would be imbalanced, as the
magnetic field is unable to exerts forces alongB. Because
of this reduction in the dimensionality of F and in view
of (43), it is convenient to write
F = Fψ∇ψ+ Fα(∇χ−¯ι∇φ) . (44)
An alternative form would be to use the contravariant
form of B (16) in (42) to get
F = (J ·∇χ−¯ιJ ·∇φ) ∇ψ−(J ·∇ψ) (∇χ−¯ι∇φ) .
(45)
Substituting (43) and (44) into (42) we get two magnetic
differential equations
B ·∇Bψ = Fψ + J−1 (B′
α −¯ι′Bθ) , (46a)
B ·∇Bθ = Fα = −J ·∇ψ. (46b)
Therefore, the generalized force-balance condition is
equivalent to two magnetic differential equations (MDEs)
and B ·F = 0. If solutions to these equations can be
found together with the magnetic equations (17) and
(18), we will have obtained a quasisymmetric configuration in
equilibrium.
Let us describe in more detail the implications of these
equations. First look at the simpler (46b). This equation
has two pieces to it. First, and regardless of the assumed
form for Fα, it follows from weak QS that B ·∇Bθ =
−j ·∇ψ (see Appendix A as well). This imposes the
condition (II B) on the field. Secondly, the component
Fα of the forcing F directly sets the off-surface current.
This means that,20
∮ dℓ
BFα = 0. (47)
We may not choose Fα arbitrarily. It must satisfy (47) if
the force is to be consistent with QS –condition like (12).
Then the magnetic differential equation can be satisfied,
and one can directly relate Bθ and Fα up to a flux function. In Fourier
representation, it is clear that the φcontent of Bθ will be non-zero
only if that of Fα is (and vice
versa). In the light of (41), choosing ∂φ(j·∇ψ) = 0 brings
the quasisymmetric configuration closer to an isometry.
This freedom in the form of Bθ does not exist in the
strong formulation of QS, for which j·∇ψis independent
of the field line label.
A similar analysis is suitable for (46a). The appropriate Newcomb
condition in this case is,
∮ dℓ
B
[
Fψ + J−1 (B′
α −¯ι′Bθ)
]
= 0. (48)
This condition may be understood as an averaged radial
equilibrium equation. A similar solvability condition can
be found, for the special case of MHS, in [4], where the
notion of a QS field is presented as one that satisfies the
Newcomb conditions. Given Eq. (48), Eq. (46a) relates
Bψ, Bθ (or Fα) and Fψ. Once again, we see the close
relationship between the forcing, the magnetic covariant
representation, and the deviation from axisymmetry. Aφ
dependence on Bψ will bring a finite deviation of u from
being a Killing vector. However, it will also force Fα
and Fψ to have a φ dependence which may require very
particular shaping of the forces. This observation is consistent with
the Constantin-Drivas-Ginsberg theorem 23,
Rodr´ ıguez et al. 8
in which the forcing is seen to be intimately related to the
deviation from an isometry. Here the asymmetric geometry, quasisymmetry,
and the forcing are all intimately
connected.
When the magnetic differential equations imposing
force balance are brought together with the magnetic
equations from the previous section, it is not obvious
how the system of equations is to be interpreted: what
is to be taken as an input and what should be solved for.
Just as an analogy, in the Grad-Shafranov equation, it is
clear that p and F are inputs, and ψ is the output. In
the present problem, we have the construction of GBC
in addition to the various magnetic covariant forms and
the components of F. Motivated by the treatment in [3]
(which deals with a special case of the above), we propose as a
possibility for a well-posed problem (46b) to be
solved for Bθ given Fα, while Fψ is the output of (46a),
with the function Bψ specified from the magnetic equations. It is not a
trivial matter to determine a convenient
way in which to formulate the problem. A more elaborate discussion on
this procedure and its implications on
constructing solutions is left to future work.
A. Ideal MHD: j ×B = ∇p(ψ)
Let us now revisit the limit of ideal MHD without
flows, j ×B = ∇p. More general forms will be discussed
elsewhere, together with a more systematic treatment of
the quasisymmetric system of equations.
In ideal MHD, from j ·∇p(ψ) = 0 it follows that
B ·∇Bθ = 0. (49)
Thus taking Fα = 0 forces Bθ into a flux function. Of
course, this also means that u·∇Bθ = 0. Furthermore, as
Fψ = p′, in static ideal MHD, (43) leads to the magnetic
differential equation for Bψ,
B ·∇Bψ = p′(ψ) + J−1 (B′
α −¯ι′Bθ) (50)
Since Bψ must be a single-valued function the fluxsurface average of (50)
gives
p′(ψ) + ⟨B2⟩
Bα
(B′
α −¯ι′Bθ) = 0. (51)
If we choose the forms ofB, p, and ι, this pins the form of
Bα down. Now, looking back to (50), every term on the
right-hand side is quasisymmetric. Therefore, Bψ must
also be quasisymmetric if it is to satisfy the force-balance
equation. Note that we have already recognized this constraining
requirement on the form of Bψ as the origin
of the Garren-Boozer overdetermination problem14. The
Newcomb condition on this equation can be recognized
as the condition to avoid Pfirsch-Schl¨ uter current singularities.
The simplifications due to ideal static MHD leads to
the vanishing of (30). Therefore, in this limit weak QS
is identical to strong QS.
One can further show using (29), (15), (23) and (50),
j = −1
¯ι∂ψ(Bα −¯ιBθ)B −1
¯ιp′(ψ)u, (52)
where the gauge choice for u has been made Φ′= ¯ι. The
expression for j ought to be u-gauge independent, as it is
a physical quantity. The B piece as written is gauge independent, but
the u term is not. The ¯ιfactor in the latter
is to be interpreted as Φ ′. This equation had been obtained previously
11,12 using coordinate free, differential
forms (see Appendix C). Two special cases of ideal MHD,
a) vacuum (j = 0) and b) force-free ( j = λ(ψ,α)B), are
worth pointing to for their importance in plasma physics.
For both these cases p′(ψ) = 0. From (52) we see that for
the magnetic field to be curl-free (vacuum) and quasisymmetric C′(ψ) = 0;
i.e., C(ψ) must be a constant. For quasisymmetric force-free fields, we
must have λ= −C′(ψ).
Note that in strong QS these conclusions follow directly
from the equation j ×u+∇C = 0 with j = 0 and j = λB.
V. CONCLUSIONS
In this paper, we have presented, defined, and discussed a straight field
line coordinate system which is
natural for the representation of general-equilibria quasisymmetric
magnetic fields: generalized Boozer coordinates. We proved the existence
of the said coordinate system for the subset of fields for which
∮
j·∇ψdl/B = 0, to
which quasisymmetric fields belong. These coordinates
reduce to Boozer coordinates when j ·∇ψ= 0.
The explicit form of the symmetry in this coordinate
representation enables a simple formulation of the quasisymmetric
problem. We explicitly construct the governing equations setting clearly
the foundation for future investigations, including expansion 14,16 and
global
approaches. Exploiting GBC, we explicitly show the essential differences
between the weak and strong formulations of QS and between quasisymmetry
and axisymmetry. Weak QS generally lies far from axisymmetry, for
which many functions describing the field and symmetry
need to be symmetric.
We also included a set of simple magnetic differential
equations that fully describe equilibrium with an arbitrary macroscopic
force to complete the treatment. The
property of QS, together with the force-balance structure, imposes
requirements on the forcing terms in the
form of Newcomb conditions. In addition, the equations
establish clear connections between QS, forcing, and departures from
axisymmetry.
ACKNOWLEDGEMENTS
We are grateful to J. Burby, N. Duigan, J. Meiss,
and D. Ginsburg for stimulating discussions. This research is supported
by grants from the Simons FoundaRodr´ ıguez et al. 9
tion/SFARI (560651, AB) and DoE Contract No DEAC02-09CH11466.
DATA A VAILABILITY
Data sharing is not applicable to this article as no new
data were created or analyzed in this study.
Appendix A: Off-surface current and QS
In this appendix we directly show how the triple vector
formulation of QS in Eq. (1) imposes a constraint on the
off-normal component of the current at every magnetic
surface.
Let us start by noting that B ·∇B = f(ψ,B) is a
consequence of (1). In that sense, we can reshape the
triple vector condition in the convenient form,
∇ψ×∇B·∇
( B2
B ·∇B
)
= 0. (A1)
(For now we shall not worry about B·∇B = 0.) We may
express this as a divergence as ∇·(∇ψ×∇B) = 0. Separating the argument
of the divergence into a component
parallel to the magnetic field and perpendicular to it, we
may rewrite (A1),
∇·
[∇ψ×∇B·B
B ·∇B B + ∇ψ×B
]
= 0. (A2)
It is customary to define C = ∇ψ×∇B·B/(B ·∇B).
This can therefore be rewritten, using ∇·B = 0,
B ·∇C = j ·∇ψ. (A3)
It is then clear that the Newcomb condition 20,
∮
j ·∇ψdl
B = 0, (A4)
with the line integral taken along closed magnetic field
lines. For irrational field lines, this amounts to ⟨j·∇ψ⟩=
0, which is true of all magnetic fields with flux surfaces
B ·∇ψ = 0. The condition (A4) is a more constraining
condition than the flux surface average one 20.
The division byB·∇Bseems to be ill-defined at the extrema of the magnetic
field along the field lines. However,
the above holds, first, arbitrarily close to these extrema,
and thus would expect to hold by continuity. The fact
that the condition holds can be seen by looking at the
fundamental formulation of QS. 3 As for the behavior of
C, this is a physical quantity related to the width of banana orbits of
bouncing particles. If deeply trapped and
barely trapped particles are to be kept, then physically
C must be finite at the extrema. Once we express everything in terms of
GBC, the apparent divergence of C
disappears as both the numerator and denominator are
proportional to ∂χB = 0 when B ·∇B = 0.
Appendix B: Gauge freedom in weak u
As noted, the definition of u by the weak quasisymmetry equations
(19)-(21) is invariant to the relabelling of
Φ. A rescaling of the flux surface label together with u
leaves the equations invariant, and should therefore have
no physical implication in the description of quasisymmetry. Keeping the
general gauge label Φ( ψ) of (20),
Eq. (23) reads,
u = ∇Φ ×∇χ
B ·∇χ . (B1)
Then it folllows,
C = Φ′
¯ι (Bα −¯ιBθ). (B2)
and
u ·∇ = Φ′
¯ι ∂φ. (B3)
In Sec. III A we then went on to evaluate how the symmetry vector u as
defined by weak QS compares to strong
QS. To do so, we needed to evaluate (30), the third of
the strong QS conditions. The first thing to note is
that this equation is not gauge-invariant. This is, of
course, an added complication to the problem. In a sense,
this gauge-symmetry breaking leaves no unique form for
j ×u + ∇C, but rather a whole family.
This can be seen explicitly if instead of restricting ourselves to the
special case Φ ′ = ¯ι, we keep Φ general.
Then, (30) can be written in the form,
j ×u + ∇C =
[
u ·∇Bψ −CΦ′
¯ι ∂ψ
( ¯ι
Φ′
)]
∇ψ+
+ (u ·∇Bθ) (∇χ−¯ι∇φ) . (B4)
So the question may be reformulated: what are the conditions for this
expression to vanish? As before one piece
is u ·∇Bθ = 0. Now, the other component has the form
of a differential equation along the direction of symmetry.
We may write it explicitly as,
∂φBψ −(Bα −¯ιBθ)∂ψln(¯ι/Φ′) = 0. (B5)
For the equation to be consistent it must then hold that
either ¯ι= aΦ′where ais a constant, or
∫
(Bα−¯ιBθ)dφ=
0. The latter is a condition on the physically relevant
part of the problem. Thus, it is minimal to choose Φ′= ¯ι
to satisfy this condition. Then we are simply left with
u ·∇Bψ = 0, which is the choice made in the main text.
Appendix C: Coordinate independent form of
quasisymmetric current
In this appendix we present a brief derivation of (52)
without using coordinates. Start from,
B ×u = ∇Φ (C1)
Rodr´ ıguez et al. 10
with Φ some single-valued function. With magnetic flux
surfaces labelled by Φ, to which B as well as the vector
field u are tangent. From this condition, one may also
write the symmetry vector field in the following form,
u = 1
|B|2 (CB −B ×∇Φ). (C2)
Here C = B ·u is some arbitrary function of space, as
the constraint equation only describes the perpendicular
part of u.
Because of magnetostatic equilibrium, we hav
j ×B = ∇p (C3)
where j = ∇×B is the plasma current density. Because
B ·∇p = 0, it follows that p = p(ψ), again assuming
that B is irrational. By continuity, we argue that this
must be true for the rational cases as well (provided the
rotational transform is not constant ι′̸= 0).
Because the plasma pressure is also a flux function, the
current densityj also lives on magnetic flux surfaces. Furthermore,
requiring ∇p to be nowhere vanishing (except
the axis), the magnetic field and the currents are guaranteed to be
non-collinear. Therefore, at every point on
a flux surface, we may define a tangent plane spanned by
j and B.
Because u also exists in this subspace, we can express
it in the chosen basis as,
u = aj + ˜βB
Taking the cross product of this form of u with B, and
making use of (C1) and (C3), it is clear that a= a(ψ) =
−Φ′/p′, where the prime denotes a derivative with respect to the flux ψ.
Let us now apply (21), namely ∇·u = 0, on our form
of u,
∇·u = B ·∇˜β = 0 ∴ ˜β = ˜β(ψ)
so that,
j = −p′(ψ)
Φ′ u + β(ψ)B. (C4)
Using the form (C2), we can then rewrite the expression
for j in the form,
j =
(
β(ψ) −p′(ψ)C
B2Φ′
)
B + p′(ψ)
B2 B ×∇ψ (C5)
The gauge-independent combination C/Φ′= (B ·∇ψ×
∇B)/(B ·∇B) here is the same as appears in (25), and
it is a flux function. There we found a closed form for
this combination in terms of the covariant representation
of B in GBC. Equation (C5), together with ∇· B = 0
and B ·∇ψ = 0 constitute the whole set of governing
equations for the QS, MHS field. The problem in this
case can be formulated by providing β, C/Φ′ and p as
input flux functions and solving for B.
Assuming that we have strong QS, β = −C′/Φ′, obtaining an expression
equivalent to that obtained in (52).
The assumption of strong QS was, however, not necessary when working
with the coordinate representation.
The adoption of some form of covariant representation
of B seems to be necessary to obtain such a relation.
To that end, GBC is convenient, though not necessary.
In fact, [11] employs the B♭ form (the generalisation of
the covariant representation of the magnetic field), without specifying
coordinates, to prove (52). Note, however,
that the difference in the form of β is merely a matter of
how the flux function is defined. The problem has β or
Φ′as free flux function inputs.
1J. N¨ uhrenberg and R. Zille, Physics Letters A129, 113 (1988).
2A. H. Boozer, The Physics of Fluids 24, 1999 (1981).
3E. Rodr´ ıguez, P. Helander, and A. Bhattacharjee, Physics of
Plasmas 27, 062501 (2020).
4M. Tessarotto, J. L. Johnson, and L. J. Zheng, Physics of Plasmas 2,
4499 (1995).
5M. Tessarotto, J. L. Johnson, R. B. White, and L. Zheng, Physics
of Plasmas 3, 2653 (1996).
6A. H. Boozer, Physics of Fluids 26, 496 (1983).
7D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma
Physics 3, 2822 (1991).
8M. Landreman and W. Sengupta, Journal of Plasma Physics 84,
905840616 (2018).
9A. Bader, M. Drevlak, D. T. Anderson, B. J. Faber, C. C. Hegna,
K. M. Likin, J. C. Schmitt, and J. N. Talmadge, Journal of
Plasma Physics 85, 905850508 (2019).
10S. Henneberg, M. Drevlak, C. N ˜A¼hrenberg, C. Beidler,
Y. Turkin, J. Loizu, and P. Helander, Nuclear Fusion 59, 026014
(2019).
11J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Physics
A: Mathematical and Theoretical 54, 125202 (2021).
12J. W. Burby, N. Kallinikos, and R. S. MacKay, Journal of Mathematical
Physics 61, 093503 (2020).
13J. W. Burby and H. Qin, Physics of Plasmas 20, 012511 (2013).
14E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,
012508 (2021).
15D. A. Garren and A. H. Boozer, Physics of Fluids B: Plasma
Physics 3, 2805 (1991).
16E. Rodr´ ıguez and A. Bhattacharjee, Physics of Plasmas 28,
012509 (2021).
17M. D. Kruskal and R. M. Kulsrud, The Physics of Fluids 1, 265
(1958).
18P. Helander, Reports on Progress in Physics 77, 087001 (2014).
19A more general form J ′ = J ′(ψ,B) could have been demanded.
However, one may show in that case that the system may always
be cast into the form here employed.
20W. A. Newcomb, The Physics of Fluids 2, 362 (1959).
21W. Sengupta, E. J. Paul, H. Weitzner, and A. Bhattacharjee,
Journal of Plasma Physics 87, 905870205 (2021).
22E. J. Paul, T. Antonsen, M. Landreman, and W. A. Cooper,
Journal of Plasma Physics 86, 905860103 (2020).
23P. Constantin, T. D. Drivas, and D. Ginsberg, Journal of Plasma
Physics 87, 905870111 (2021).
- >-
[ABSTRACT]
Real-world time-series datasets are often multivariate with complex
dynamics. To capture
this complexity, high capacity architectures like recurrent- or
attention-based sequential
deep learning models have become popular. However, recent work
demonstrates that simple
univariate linear models can outperform such deep learning models on
several commonly
used academic benchmarks. Extending them, in this paper, we investigate
the capabilities of
linear models for time-series forecasting and present Time-Series Mixer
(TSMixer), a novel
architecture designed by stacking multi-layer perceptrons (MLPs).
TSMixer is based on
mixingoperationsalongboththetimeandfeaturedimensionstoextractinformationefficiently.
On popular academic benchmarks, the simple-to-implement TSMixer is
comparable to
specialized state-of-the-art models that leverage the inductive biases
of specific benchmarks.
On the challenging and large scale M5 benchmark, a real-world retail
dataset, TSMixer
demonstrates superior performance compared to the state-of-the-art
alternatives. Our results
underline the importance of efficiently utilizing cross-variate and
auxiliary information for
improving the performance of time series forecasting. We present various
analyses to shed
light into the capabilities of TSMixer. The design paradigms utilized in
TSMixer are expected
to open new horizons for deep learning-based time series forecasting.
The implementation
is available
at:https://github.com/google-research/google-research/tree/master/
tsmixer.
[1 Introduction]
Time series forecasting is a prevalent problem in numerous real-world
use cases, such as for forecasting of
demand of products (Böse et al., 2017; Courty & Li, 1999), pandemic
spread (Zhang & Nawata, 2018), and
inflation rates (Capistrán et al., 2010). The forecastability of time
series data often originates from three
major aspects:
arXiv:2303.06053v5 [cs.LG] 11 Sep 2023
Published in Transactions on Machine Learning Research (09/2023)
Figure 1: TSMixer for multivariate time series forecasting. The columns
of the inputs means different
features/variates and the rows are time steps. The fully-connected
operations are row-wise. TSMixer contains
interleaving time-mixing and feature-mixing MLPs to aggregate
information. The number of mixer layer is
denoted asN. The time-mixing MLPs are shared across all features and the
feature-mixing MLPs are shared
across all of the time steps. The design allow TSMixer to automatically
adapt the use of both temporal and
cross-variate information with limited number of parameters for superior
generalization. The extension with
auxiliary information is also explored in this paper.
• Persistent temporal patterns: encompassing trends and seasonal
patterns, e.g., long-term inflation,
day-of-week effects;
• Cross-variate information: correlations between different variables,
e.g., an increase in blood pressure
associated with a rise in body weight;
• Auxiliary features: comprising static features and future information,
e.g., product categories and
promotional events.
Published in Transactions on Machine Learning Research (09/2023)
Traditional models, such as ARIMA (Box et al., 1970), are designed for
univariate time series, where only
temporal information is available. Therefore, they face limitations when
dealing with challenging real-world
data, which often contains complex cross-variate information and
auxiliary features. In contrast, numerous
deep learning models, particularly Transformer-based models, have been
proposed due to their capacity to
capture both complex temporal patterns and cross-variate dependencies
(Gamboa, 2017; Li et al., 2019; Wen
et al., 2017; Zhou et al., 2021; Wu et al., 2021; Lim & Zohren, 2021;
Liu et al., 2022a; Zhou et al., 2022b; Liu
et al., 2022b; Zhou et al., 2022a) .
The natural intuition is that multivariate models, such as those based
on Transformer architectures, should be
more effective than univariate models due to their ability to leverage
cross-variate information. However, Zeng
et al. (2023) revealed that this is not always the case –
Transformer-based models can indeed be significantly
worse than simple univariate temporal linear models on many commonly
used forecasting benchmarks. The
multivariate models seem to suffer from overfitting especially when the
target time series is not correlated
with other covariates. This surprising finding has raised two essential
questions:
1. Does cross-variate information truly provide a benefit for time
series forecasting?
2. When cross-variate information is not beneficial, can multivariate
models still perform as well as
univariate models?
To address these questions, we begin by analyzing the effectiveness of
temporal linear models. Our findings
indicate that their time-step-dependent characteristics render temporal
linear models great candidates for
learning temporal patterns under common assumptions. Consequently, we
gradually increase the capacity of
linear models by
1. stacking temporal linear models with non-linearities (TMix-Only),
2. introducing cross-variate feed-forward layers (TSMixer).
The resulting TSMixer alternatively applies MLPs across time and feature
dimensions, conceptually corresponding totime-mixing and feature-mixing
operations, efficiently capturing both temporal patterns and
cross-variate information, as illustrated in Fig. 1. The residual
designs ensure that TSMixer retains the
capacity of temporal linear models while still being able to exploit
cross-variate information.
We evaluate TSMixer on commonly used long-term forecasting datasets (Wu
et al., 2021) where univariate
models have outperformed multivariate models. Our ablation study
demonstrates the effectiveness of stacking
temporal linear models and validates that cross-variate information is
less beneficial on these popular datasets,
explaining the superior performance of univariate models. Even so,
TSMixer is on par with state-of-the-art
univariate models and significantly outperforms other multivariate
models.
To demonstrate the benefit of multivariate models, we further evaluate
TSMixer on the challenging M5
benchmark, a large-scale retail dataset used in the M-competition
(Makridakis et al., 2022). M5 contains crucial
cross-variate interactions such as sell prices (Makridakis et al.,
2022). The results show that cross-variate
information indeed brings significant improvement, and TSMixer can
effectively leverage this information.
Furthermore, we propose a principle design to extend TSMixer to handle
auxiliary information such as static
features and future time-varying features. It aligns the different types
of features into the same shape then
applied mixer layers on the concatenated features to leverage the
interactions between them. In this more
practical and challenging setting, TSMixer outperforms models that are
popular in industrial applications,
including DeepAR (Salinas et al. 2020, Amazon SageMaker) and TFT (Lim et
al. 2021, Google Cloud Vertex),
demonstrating its strong potential for real world impact.
We summarize our contributions as below:
• We analyze the effectiveness of state-of-the-art linear models and
indicate that their time-stepdependent characteristics make them great
candidates for learning temporal patterns under common
assumptions.
Published in Transactions on Machine Learning Research (09/2023)
Table 1: Recent works in time series forecasting. Category I is
univariate time series forecasting; Category II
is multivariate time series forecasting, and Category III is time series
forecasting with auxiliary information.
In this work, we propose TSMixer for Category II. We also extend TSMixer
to leverage auxiliary information
including static and future time-varying features for Category III.
Category
Extrapolating Consideration of Consideration of
Modelstemporal patternscross-variate informationauxiliary features
(i.e. multivariateness)
I ✔
ARIMA (Box et al., 1970)
N-BEATS (Oreshkin et al., 2020)
LTSF-Linear (Zeng et al., 2023)
PatchTST (Nie et al., 2023)
II ✔ ✔
Informer (Zhou et al., 2021)
Autoformer (Wu et al., 2021)
Pyraformer (Liu et al., 2022a)
FEDformer (Zhou et al., 2022b)
NS-Transformer (Liu et al., 2022b)
FiLM (Zhou et al., 2022a)
TSMixer(this work)
III ✔ ✔ ✔
MQRNN (Wen et al., 2017)
DSSM (Rangapuram et al., 2018)
DeepAR (Salinas et al., 2020)
TFT (Lim et al., 2021)
TSMixer-Ext(this work)
• We propose TSMixer, an innovative architecture which retains the
capacity of linear models to
capture temporal patterns while still being able to exploit
cross-variate information.
• We point out the potential risk of evaluating multivariate models on
common long-term forecasting
benchmarks.
• Our empirical studies demonstrate that TSMixer is the first
multivariate model which is on par with
univariate models on common benchmarks and achieves state-of-the-art on
a large-scale industrial
application where cross-variate information is crucial.
[4 Tsmixer Architecture]
Expanding upon our finding that linear models can serve as strong
candidates for capturing time dependencies,
we initially propose a natural enhancement by stacking linear models
with non-linearities to form multi-layer
perceptrons (MLPs). Common deep learning techniques, such as
normalization and residual connections, are
applied to facilitate efficient learning. However, this architecture
does not take cross-variate information into
account.
To better leverage cross-variate information, we propose the application
of MLPs in the time-domain and
the feature-domain in an alternating manner. The time-domain MLPs are
shared across all of the features,
while the feature-domain MLPs are shared across all of the time steps.
This resulting model is akin to the
MLP-Mixer architecture from computer vision (Tolstikhin et al., 2021),
with time-domain and feature-domain
Published in Transactions on Machine Learning Research (09/2023)
wt−2 wt−1 wt ft−2(x)ft−1(x) ft(x)
xt−2 xt−1 xt
xt+1
xt+1 = ∑t
i=1 wixi
Time-step-dependent
xt−2 xt−1 xt
xt+1
xt+1 = ∑t
i=1 fi(x)xi
Data-dependent
Figure 2: Illustrations of time-step-dependent and data-dependent models
within a single forecasting time
step.
Figure 3: The architecture of TMix-Only. It is similar to TSMixer but
only applies time-mixing.
operations representing time-mixing and feature-mixing operations,
respectively. Consequently, we name our
proposed architecture Time-Series Mixer (TSMixer).
The interleaving design between these two operations efficiently
utilizes both temporal dependencies and
cross-variate information while limiting computational complexity and
model size. It allows TSMixer to use a
long lookback window (see Sec. 3), while maintaining the parameter
growth in onlyO(L+ C) instead of
O(LC) if fully-connected MLPs were used. To better understand the
utility of cross-variate information and
feature-mixing, we also consider a simplified variant of TSMixer that
only employs time-mixing, referred to
as TMix-Only, which consists of a residual MLP shared across each
variate, as illustrated in Fig. 3. We also
present the extension of TSMixer to scenarios where auxiliary
information about the time series is available.
[6 Conclusions]
We propose TSMixer, a novel architecture for time series forecasting
that is designed using MLPs instead of
commonly used RNNs and attention mechanisms to obtain superior
generalization with a simple architecture.
Our results at a wide range of real-world time series forecasting tasks
demonstrate that TSMixer is highly
effective in both long-term forecasting benchmarks for multivariate
time-series, and real-world large-scale
retail demand forecasting tasks. Notably, TSMixer is the only
multivariate model that is able to achieve
similar performance to univariate models in long term time series
forecasting benchmarks. The TSMixer
architecture has significant potential for further improvement and we
believe it will be useful in a wide
range of time series forecasting tasks. Some of the potential future
works include further exploring the
interpretability of TSMixer, as well as its scalability to even larger
datasets. We hope this work will pave the
way for more innovative architectures for time series forecasting.
Published in Transactions on Machine Learning Research (09/2023)
- >-
[ABSTRACT]
Low-dimensional embeddings of nodes in large graphs have proved
extremely
useful in a variety of prediction tasks, from content recommendation to
identifying
protein functions. However, most existing approaches require that all
nodes in the
graph are present during training of the embeddings; these previous
approaches are
inherently transductive and do not naturally generalize to unseen nodes.
Here we
present GraphSAGE, a general inductive framework that leverages node
feature
information (e.g., text attributes) to efficiently generate node
embeddings for
previously unseen data. Instead of training individual embeddings for
each node,
we learn a function that generates embeddings by sampling and
aggregating features
from a node’s local neighborhood. Our algorithm outperforms strong
baselines
on three inductive node-classification benchmarks: we classify the
category of
unseen nodes in evolving information graphs based on citation and Reddit
post
data, and we show that our algorithm generalizes to completely unseen
graphs
using a multi-graph dataset of protein-protein interactions.
[1 Introduction]
Low-dimensional vector embeddings of nodes in large graphs 1 have proved
extremely useful as
feature inputs for a wide variety of prediction and graph analysis tasks
[5, 11, 28, 35, 36]. The basic
idea behind node embedding approaches is to use dimensionality reduction
techniques to distill the
high-dimensional information about a node’s graph neighborhood into a
dense vector embedding.
These node embeddings can then be fed to downstream machine learning
systems and aid in tasks
such as node classification, clustering, and link prediction [11, 28,
35].
However, previous works have focused on embedding nodes from a single
fixed graph, and many
real-world applications require embeddings to be quickly generated for
unseen nodes, or entirely new
(sub)graphs. This inductive capability is essential for high-throughput,
production machine learning
systems, which operate on evolving graphs and constantly encounter
unseen nodes (e.g., posts on
Reddit, users and videos on Youtube). An inductive approach to
generating node embeddings also
facilitates generalization across graphs with the same form of features:
for example, one could train
an embedding generator on protein-protein interaction graphs derived
from a model organism, and
then easily produce node embeddings for data collected on new organisms
using the trained model.
The inductive node embedding problem is especially difficult, compared to
the transductive setting,
because generalizing to unseen nodes requires “aligning” newly observed
subgraphs to the node
embeddings that the algorithm has already optimized on. An inductive
framework must learn to
∗The two first authors made equal contributions.
1While it is common to refer to these data structures as social or
biological networks, we use the term graph
to avoid ambiguity with neural network terminology.
31st Conference on Neural Information Processing Systems (NIPS 2017),
Long Beach, CA, USA.
arXiv:1706.02216v4 [cs.SI] 10 Sep 2018
Figure 1: Visual illustration of the GraphSAGE sample and aggregate
approach.
recognize structural properties of a node’s neighborhood that reveal
both the node’s local role in the
graph, as well as its global position.
Most existing approaches to generating node embeddings are inherently
transductive. The majority
of these approaches directly optimize the embeddings for each node using
matrix-factorization-based
objectives, and do not naturally generalize to unseen data, since they
make predictions on nodes in a
single, fixed graph [5, 11, 23, 28, 35, 36, 37, 39]. These approaches can
be modified to operate in an
inductive setting (e.g., [28]), but these modifications tend to be
computationally expensive, requiring
additional rounds of gradient descent before new predictions can be
made. There are also recent
approaches to learning over graph structures using convolution operators
that offer promise as an
embedding methodology [17]. So far, graph convolutional networks (GCNs)
have only been applied
in the transductive setting with fixed graphs [17, 18]. In this work we
both extend GCNs to the task
of inductive unsupervised learning and propose a framework that
generalizes the GCN approach to
use trainable aggregation functions (beyond simple convolutions).
Present work. We propose a general framework, called GraphSAGE (SAmple
and aggreGatE), for
inductive node embedding. Unlike embedding approaches that are based on
matrix factorization,
we leverage node features (e.g., text attributes, node profile
information, node degrees) in order to
learn an embedding function that generalizes to unseen nodes. By
incorporating node features in the
learning algorithm, we simultaneously learn the topological structure of
each node’s neighborhood
as well as the distribution of node features in the neighborhood. While
we focus on feature-rich
graphs (e.g., citation data with text attributes, biological data with
functional/molecular markers), our
approach can also make use of structural features that are present in
all graphs (e.g., node degrees).
Thus, our algorithm can also be applied to graphs without node features.
Instead of training a distinct embedding vector for each node, we train
a set of aggregator functions
that learn to aggregate feature information from a node’s local
neighborhood (Figure 1). Each
aggregator function aggregates information from a different number of
hops, or search depth, away
from a given node. At test, or inference time, we use our trained system
to generate embeddings for
entirely unseen nodes by applying the learned aggregation functions.
Following previous work on
generating node embeddings, we design an unsupervised loss function that
allows GraphSAGE to be
trained without task-specific supervision. We also show that GraphSAGE
can be trained in a fully
supervised manner.
We evaluate our algorithm on three node-classification benchmarks, which
test GraphSAGE’s ability
to generate useful embeddings on unseen data. We use two evolving
document graphs based on
citation data and Reddit post data (predicting paper and post
categories, respectively), and a multigraph generalization experiment
based on a dataset of protein-protein interactions (predicting protein
functions). Using these benchmarks, we show that our approach is able to
effectively generate
representations for unseen nodes and outperform relevant baselines by a
significant margin: across
domains, our supervised approach improves classification F1-scores by an
average of 51% compared
to using node features alone and GraphSAGE consistently outperforms a
strong, transductive baseline
[28], despite this baseline taking ∼100×longer to run on unseen nodes.
We also show that the new
aggregator architectures we propose provide significant gains (7.4% on
average) compared to an
aggregator inspired by graph convolutional networks [17]. Lastly, we
probe the expressive capability
of our approach and show, through theoretical analysis, that GraphSAGE
is capable of learning
structural information about a node’s role in a graph, despite the fact
that it is inherently based on
features (Section 5).
[3.3 Aggregator Architectures]
Unlike machine learning over N-D lattices (e.g., sentences, images, or
3-D volumes), a node’s
neighbors have no natural ordering; thus, the aggregator functions in
Algorithm 1 must operate over
an unordered set of vectors. Ideally, an aggregator function would be
symmetric (i.e., invariant to
permutations of its inputs) while still being trainable and maintaining
high representational capacity.
The symmetry property of the aggregation function ensures that our
neural network model can
be trained and applied to arbitrarily ordered node neighborhood feature
sets. We examined three
candidate aggregator functions:
Mean aggregator. Our first candidate aggregator function is the mean
operator, where we simply
take the elementwise mean of the vectors in {hk−1
u ,∀u ∈N(v)}. The mean aggregator is nearly
equivalent to the convolutional propagation rule used in the
transductive GCN framework [17]. In
particular, we can derive an inductive variant of the GCN approach by
replacing lines 4 and 5 in
Algorithm 1 with the following:4
hk
v ←σ(W ·MEAN ({hk−1
v }∪{hk−1
u ,∀u∈N(v)}). (2)
We call this modified mean-based aggregatorconvolutional since it is a
rough, linear approximation of
a localized spectral convolution [17]. An important distinction between
this convolutional aggregator
and our other proposed aggregators is that it does not perform the
concatenation operation in line
5 of Algorithm 1—i.e., the convolutional aggregator does concatenate the
node’s previous layer
representation hk−1
v with the aggregated neighborhood vector hk
N(v). This concatenation can be
viewed as a simple form of a “skip connection” [13] between the
different “search depths”, or “layers”
of the GraphSAGE algorithm, and it leads to significant gains in
performance (Section 4).
LSTM aggregator. We also examined a more complex aggregator based on an
LSTM architecture
[14]. Compared to the mean aggregator, LSTMs have the advantage of
larger expressive capability.
However, it is important to note that LSTMs are not inherently symmetric
(i.e., they are not permutation invariant), since they process their
inputs in a sequential manner. We adapt LSTMs to operate on
an unordered set by simply applying the LSTMs to a random permutation of
the node’s neighbors.
3Exploring non-uniform samplers is an important direction for future
work.
4Note that this differs from Kipf et al’s exact equation by a minor
normalization constant [17].
Pooling aggregator. The final aggregator we examine is both symmetric and
trainable. In this
pooling approach, each neighbor’s vector is independently fed through a
fully-connected neural
network; following this transformation, an elementwise max-pooling
operation is applied to aggregate
information across the neighbor set:
AGGREGATE pool
k = max({σ
(
Wpoolhk
ui + b
)
,∀ui ∈N(v)}), (3)
where max denotes the element-wise max operator and σ is a nonlinear
activation function. In
principle, the function applied before the max pooling can be an
arbitrarily deep multi-layer perceptron, but we focus on simple
single-layer architectures in this work. This approach is inspired by
recent advancements in applying neural network architectures to learn
over general point sets [29].
Intuitively, the multi-layer perceptron can be thought of as a set of
functions that compute features for
each of the node representations in the neighbor set. By applying the
max-pooling operator to each of
the computed features, the model effectively captures different aspects
of the neighborhood set. Note
also that, in principle, any symmetric vector function could be used in
place of the max operator
(e.g., an element-wise mean). We found no significant difference between
max- and mean-pooling in
developments test and thus focused on max-pooling for the rest of our
experiments.
[6 Conclusion]
We introduced a novel approach that allows embeddings to be efficiently
generated for unseen nodes.
GraphSAGE consistently outperforms state-of-the-art baselines,
effectively trades off performance
and runtime by sampling node neighborhoods, and our theoretical analysis
provides insight into
how our approach can learn about local graph structures. A number of
extensions and potential
improvements are possible, such as extending GraphSAGE to incorporate
directed or multi-modal
graphs. A particularly interesting direction for future work is
exploring non-uniform neighborhood
sampling functions, and perhaps even learning these functions as part of
the GraphSAGE optimization.
Acknowledgments
The authors thank Austin Benson, Aditya Grover, Bryan He, Dan Jurafsky,
Alex Ratner, Marinka
Zitnik, and Daniel Selsam for their helpful discussions and comments on
early drafts. The authors
would also like to thank Ben Johnson for his many useful questions and
comments on our code and
Nikhil Mehta and Yuhui Ding for catching some minor errors in a previous
version of the appendix.
This research has been supported in part by NSF IIS-1149837, DARPA
SIMPLEX, Stanford Data
Science Initiative, Huawei, and Chan Zuckerberg Biohub. WLH was also
supported by the SAP
Stanford Graduate Fellowship and an NSERC PGS-D grant. The views and
conclusions expressed
in this material are those of the authors and should not be interpreted
as necessarily representing
the official policies or endorsements, either expressed or implied, of
the above funding agencies,
corporations, or the U.S. and Canadian governments.
- source_sentence: How can I use transformer models for detailed image analysis?
sentences:
- >-
[ABSTRACT]
We consider matrix completion for recommender systems from the point of
view of
link prediction on graphs. Interaction data
such as movie ratings can be represented by a
bipartite user-item graph with labeled edges
denoting observed ratings. Building on recent
progress in deep learning on graph-structured
data, we propose a graph auto-encoder framework based on differentiable
message passing
on the bipartite interaction graph. Our model
shows competitive performance on standard
collaborative filtering benchmarks. In settings
where complimentary feature information or
structured data such as a social network is
available, our framework outperforms recent
state-of-the-art methods.
[1 Introduction]
With the explosive growth of e-commerce and social
media platforms, recommendation algorithms have become indispensable
tools for many businesses. Two
main branches of recommender algorithms are often
distinguished: content-based recommender systems [24]
and collaborative filtering models [9]. Content-based
recommender systems use content information of users
and items, such as their respective occupation and
genre, to predict the next purchase of a user or rating of an item.
Collaborative filtering models solve
the matrix completion task by taking into account the
collective interaction data to predict future ratings or
purchases.
In this work, we view matrix completion as a link prediction problem on
graphs: the interaction data in
collaborative filtering can be represented by a bipartite graph between
user and item nodes, with observed
ratings/purchases represented by links. Content information can
naturally be included in this framework
1Canadian Institute for Advanced Research
in the form of node features. Predicting ratings then
reduces to predicting labeled links in the bipartite useritem graph.
We propose graph convolutional matrix completion
(GC-MC): a graph-based auto-encoder framework for
matrix completion, which builds on recent progress
in deep learning on graphs [2, 6, 19, 5, 15, 30, 14].
The auto-encoder produces latent features of user and
item nodes through a form of message passing on the
bipartite interaction graph. These latent user and item
representations are used to reconstruct the rating links
through a bilinear decoder.
The benefit of formulating matrix completion as a link
prediction task on a bipartite graph becomes especially
apparent when recommender graphs are accompanied
with structured external information such as social
networks. Combining such external information with
interaction data can alleviate performance bottlenecks
related to the cold start problem. We demonstrate that
our graph auto-encoder model efficiently combines interaction data with
side information, without resorting
to recurrent frameworks as in [22].
The paper is structured as follows: in Section 2 we
introduce our graph auto-encoder model for matrix
completion. Section 3 discusses related work. Experimental results are
shown in Section 4, and conclusion
and future research directions are discussed in Section
5.
[2.4 Model Training]
Loss function During model training, we minimize
the following negative log likelihood of the predicted
ratings ˇMij:
L= −
∑
i,j;Ωij=1
R∑
r=1
I[r= Mij] logp( ˇMij = r) , (6)
with I[k = l] = 1 when k = l and zero otherwise.
The matrix Ω ∈ {0,1}Nu×Ni serves as a mask for
unobserved ratings, such that ones occur for elements
corresponding to observed ratings in M, and zeros
for unobserved ratings. Hence, we only optimize over
observed ratings.
Node dropout In order for the model to generalize
well to unobserved ratings, it is trained in a denoising
setup by randomly dropping out all outgoing messages
of a particular node, with a probabilitypdropout, which
we will refer to asnode dropout. Messages are rescaled
after dropout as in [28]. In initial experiments we found
that node dropout was more efficient in regularizing
than message dropout. In the latter case individual
outgoing messages are dropped out independently, making embeddings more
robust against the presence or
absence of single edges. In contrast, node dropout also
causes embeddings to be more independent of particular user or item
influences. We furthermore also apply
regular dropout [28] to the hidden layer units (3).
Mini-batching We introduce mini-batching by sampling contributions to
the loss function in Eq.(6) from
different observed ratings. That is, we sample only a
fixed number of contributions from the sum over user
and item pairs. By only considering a fixed number
of contributions to the loss function, we can remove
respective rows of users and items inM1,...,M R in
Eq. (7) that do not appear in the current batch. This
serves both as an effective means of regularization, and
reduces the memory requirement to train the model,
which is necessary to fit the full model for MovieLens10M into GPU
memory. We experimentally verified
that training with mini-batches and full batches leads
to comparable results for the MovieLens-1M dataset
while adjusting for regularization parameters. For all
datasets except for the MovieLens-10M, we opt for fullbatch training
since it leads to faster convergence than
training with mini-batches in this particular setting.
[5 Conclusions]
In this work, we have introduced graph convolutional
matrix completion (GC-MC): a graph auto-encoder
framework for the matrix completion task in recommender systems. The
encoder contains a graph convolution layer that constructs user and item
embeddings
through message passing on the bipartite user-item
interaction graph. Combined with a bilinear decoder,
new ratings are predicted in the form of labeled edges.
The graph auto-encoder framework naturally generalizes to include side
information for both users and
items. In this setting, our proposed model outperforms recent related
methods by a large margin, as
demonstrated on a number of benchmark datasets with
feature- and graph-based side information. We further
show that our model can be trained on larger scale
datasets through stochastic mini-batching. In this setting, our model
achieves results that are competitive
with recent state-of-the-art collaborative filtering.
In future work, we wish to extend this model to largescale multi-modal
data (comprised of text, images, and
other graph-based information), such as present in
many realisitic recommendation platforms. In such
a setting, the GC-MC model can be combined with
recurrent (for text) or convolutional neural networks
(for images). To address scalability, it is necessary to
develop efficient approximate schemes, such as subsampling local
neighborhoods [10]. Finally, attention
mechanisms [1] provide a promising future avenue for
extending this class of models.
Acknowledgments
We would like to thank Jakub Tomczak, Christos
Louizos, Karen Ullrich and Peter Bloem for helpful
discussions and comments. This project is supported
by the SAP Innovation Center Network.
- >-
[ABSTRACT]
Fully Convolutional Neural Networks (FCNNs) with contracting and
expanding paths have shown prominence for the
majority of medical image segmentation applications since
the past decade. In FCNNs, the encoder plays an integral
role by learning both global and local features and contextual
representations which can be utilized for semantic output
prediction by the decoder. Despite their success, the locality
of convolutional layers in FCNNs, limits the capability of
learning long-range spatial dependencies. Inspired by the
recent success of transformers for Natural Language Processing (NLP) in
long-range sequence learning, we reformulate
the task of volumetric (3D) medical image segmentation as
a sequence-to-sequence prediction problem. We introduce a
novel architecture, dubbed as UNEt TRansformers (UNETR),
that utilizes a transformer as the encoder to learn sequence
representations of the input volume and effectively capture
the global multi-scale information, while also following the
successful “U-shaped” network design for the encoder and
decoder. The transformer encoder is directly connected to
a decoder via skip connections at different resolutions to
compute the final semantic segmentation output. We have
validated the performance of our method on the Multi Atlas
Labeling Beyond The Cranial Vault (BTCV) dataset for multiorgan
segmentation and the Medical Segmentation Decathlon
(MSD) dataset for brain tumor and spleen segmentation tasks.
Our benchmarks demonstrate new state-of-the-art performance on the BTCV
leaderboard.
Code: https://monai.io/research/unetr
[1. Introduction]
Image segmentation plays an integral role in quantitative
medical image analysis as it is often the first step for analysis
of anatomical structures [33]. Since the advent of deep learning, FCNNs
and in particular “U-shaped“ encoder-decoder arTransformer Encoder
Linear Projection of Flattened Patches
9864321 5 70 *
Decoder
𝐻×𝑊 ×𝐷 ×𝐶
Segmentation
Output
3D Patches
Figure 1. Overview of UNETR. Our proposed model consists
of a transformer encoder that directly utilizes 3D patches and is
connected to a CNN-based decoder via skip connection.
chitectures [22, 23, 21] have achieved state-of-the-art results
in various medical semantic segmentation tasks [2, 38, 19]. In
a typical U-Net [36] architecture, the encoder is responsible
for learning global contextual representations by gradually
downsampling the extracted features, while the decoder upsamples the
extracted representations to the input resolution
for pixel/voxel-wise semantic prediction. In addition, skip
connections merge the output of the encoder with decoder
at different resolutions, hence allowing for recovering spatial
information that is lost during downsampling.
Although such FCNN-based approaches have powerful
representation learning capabilities, their performance in
learning long-range dependencies is limited to their localized
receptive fields [ 20, 35]. As a result, such a deficiency
in capturing multi-scale information leads to sub-optimal
segmentation of structures with variable shapes and scales
(e.g. brain lesions with different sizes). Several efforts have
used atrous convolutional layers [9, 27, 18] to enlarge the
arXiv:2103.10504v3 [eess.IV] 9 Oct 2021
receptive fields. However, locality of the receptive fields in
convolutional layers still limits their learning capabilities to
relatively small regions. Combining self-attention modules
with convolutional layers [45, 50, 16] has been proposed to
improve the non-local modeling capability.
In Natural Language Processing (NLP), transformer-based
models [ 42, 13] achieve state-of-the-art benchmarks in
various tasks. The self-attention mechanism of transformers
allows to dynamically highlight the important features of
word sequences. Additionally, in computer vision, using
transformers as a backbone encoder is beneficial due to their
great capability of modeling long-range dependencies and
capturing global context [14, 4]. Specifically, unlike the local
formulation of convolutions, transformers encode images as
a sequence of 1D patch embeddings and utilize self-attention
modules to learn a weighted sum of values that are calculated
from hidden layers. As a result, this flexible formulation
allows to effectively learn the long-range information.
Furthermore, Vision Transformer (ViT) [14] and its variants
have shown excellent capabilities in learning pre-text tasks
that can be transferred to down-stream applications [40, 6, 3].
In this work, we propose to leverage the power of
transformers for volumetric medical image segmentation and
introduce a novel architecture dubbed as UNEt TRansformers
(UNETR). In particular, we reformulate the task of 3D segmentation as a
1D sequence-to-sequence prediction problem
and use a transformer as the encoder to learn contextual
information from the embedded input patches. The extracted
representations from the transformer encoder are merged
with the CNN-based decoder via skip connections at multiple
resolutions to predict the segmentation outputs. Instead of
using transformers in the decoder, our proposed framework
uses a CNN-based decoder. This is due to the fact that transformers are
unable to properly capture localized information,
despite their great capability of learning global information.
We validate the effectiveness of our method on 3D CT
and MRI segmentation tasks using Beyond the Cranial
Vault (BTCV) [ 26] and Medical Segmentation Decathlon
(MSD) [ 38] datasets. In BTCV dataset, UNETR achieves
new state-of-the-art performance on both Standard and
Free Competition sections on its leaderboard. UNETR
outperforms the state-of-the-art methodologies on both brain
tumor and spleen segmentation tasks in MSD dataset.
our main contributions of this work are as follows::
• We propose a novel transformer-based model for
volumetric medical image segmentation.
• To this end, we propose a novel architecture in which (1)
a transformer encoder directly utilizes the embedded 3D
volumes to effectively capture long-range dependencies;
(2) a skip-connected decoder combines the extracted
representations at different resolutions and predicts the
segmentation output.
• We validate the effectiveness of our proposed model for
different volumetric segmentation tasks on two public
datasets: BTCV [26] and MSD [38]. UNETR achieves
new state-of-the-art performance on leaderboard of
BTCV dataset and outperforms competing approaches
on the MSD dataset.
[3.1. Architecture]
We have presented an overview of the proposed model
in Fig. 2. UNETR utilizes a contracting-expanding pattern consisting of
a stack of transformers as the encoder
which is connected to a decoder via skip connections. As
commonly used in NLP, the transformers operate on 1D
sequence of input embeddings. Similarly, we create a 1D
sequence of a 3D input volume x ∈RH×W×D×C with
resolution (H,W,D) and C input channels by dividing it into
flattened uniform non-overlapping patches xv ∈RN×(P3.C)
where (P, P, P) denotes the resolution of each patch and
N =(H×W ×D)/P3 is the length of the sequence.
Subsequently, we use a linear layer to project the patches
into a K dimensional embedding space, which remains
constant throughout the transformer layers. In order to
preserve the spatial information of the extracted patches, we
add a 1D learnable positional embeddingEpos ∈RN×K to
the projected patch embeddingE∈R(P3.C)×K according to
z0 =[x1
vE;x2
vE;...;xN
v E]+Epos, (1)
Note that the learnable [class] token is not added to
the sequence of embeddings since our transformer backbone
is designed for semantic segmentation. After the embedding
layer, we utilize a stack of transformer blocks [42, 14] comprising of
multi-head self-attention (MSA) and multilayer
perceptron (MLP) sublayers according to
z′
i =MSA(Norm(zi−1))+zi−1, i =1...L, (2)
zi =MLP(Norm(z′
i))+z′
i, i =1...L, (3)
where Norm() denotes layer normalization [1], MLP comprises of two
linear layers with GELU activation functions,
i is the intermediate block identifier, andL is the number of
transformer layers.
A MSA sublayer comprises of n parallel self-attention
(SA) heads. Specifically, the SA block, is a parameterized
function that learns the mapping between a query ( q) and
the corresponding key ( k) and value ( v) representations
in a sequence z ∈RN×K. The attention weights ( A) are
computed by measuring the similarity between two elements
in z and their key-value pairs according to
A=Softmax( qk⊤
√Kh
), (4)
where Kh = K/n is a scaling factor for maintaining the
number of parameters to a constant value with different
values of the keyk. Using the computed attention weights,
the output ofSA for valuesv in the sequencez is computed as
SA(z)= Av, (5)
Here, v denotes the values in the input sequence and
Kh = K/n is a scaling factor. Furthermore, the output of
MSA is defined as
MSA(z)=[ SA1(z);SA2(z);...;SAn(z)]Wmsa, (6)
where Wmsa ∈ Rn.Kh×K represents the multi-headed
trainable parameter weights.
Inspired by architectures that are similar to U-Net [ 36],
where features from multiple resolutions of the encoder are
merged with the decoder, we extract a sequence representation zi (i
∈{3,6,9,12}), with size H×W×D
P3 ×K, from the
transformer and reshape them into aH
P ×W
P ×D
P ×K tensor.
A representation in our definition is in the embedding space
after it has been reshaped as an output of the transformer
with feature size of K (i.e. transformer’s embedding size).
Furthermore, as shown in Fig. 2, at each resolution we project
the reshaped tensors from the embedding space into the input
space by utilizing consecutive3×3×3 convolutional layers
that are followed by normalization layers.
At the bottleneck of our encoder (i.e. output of transformer’s last
layer), we apply a deconvolutional layer to the
transformed feature map to increase its resolution by a factor
of 2. We then concatenate the resized feature map with the
Embedded
Patches
Norm
Multi-Head
Attention
+
Norm
MLP
+
Linear
Projection
𝑧3
C
C
C
C
Conv 3×3×3, BN, ReLU
Deconv 2×2×2, Conv 3×3×3, BN, ReLU
Deconv 2×2×2
Conv 1×1×1
𝑧6
𝑧9
𝑧12
×12
𝐻
16× 𝑊
16× 𝐷
16×768
𝐻
16× 𝑊
16× 𝐷
16×768
𝐻
16× 𝑊
16× 𝐷
16×768
𝐻
16× 𝑊
16× 𝐷
16×768
𝐻×𝑊 ×𝐷 ×4
𝐻
4 ×𝑊
4 ×𝐷
4 ×256
𝐻
8 ×𝑊
8 ×𝐷
8 ×512
𝐻
2 ×𝑊
2 ×𝐷
2 ×128
𝐻
2 ×𝑊
2 ×𝐷
2 ×128
𝐻
4 ×𝑊
4 ×𝐷
4 ×256
𝐻×𝑊 ×𝐷 ×64
𝐻×𝑊 ×𝐷 ×64
Input
𝐻×𝑊 ×𝐷 ×3
Output
Figure 2. Overview of UNETR architecture. A 3D input volume (e.g.C=4
channels for MRI images), is divided into a sequence of uniform
non-overlapping patches and projected into an embedding space using a
linear layer. The sequence is added with a position embedding and
used as an input to a transformer model. The encoded representations of
different layers in the transformer are extracted and merged with a
decoder via skip connections to predict the final segmentation. Output
sizes are given for patch resolutionP =16 and embedding sizeK=768.
feature map of the previous transformer output (e.g.z9), and
feed them into consecutive 3 ×3 ×3 convolutional layers
and upsample the output using a deconvolutional layer. This
process is repeated for all the other subsequent layers up
to the original input resolution where the final output is fed
into a 1×1×1 convolutional layer with a softmax activation
function to generate voxel-wise semantic predictions.
[7. Conclusion]
This paper introduces a novel transformer-based architecture, dubbed as
UNETR, for semantic segmentation of
volumetric medical images by reformulating this task as a 1D
sequence-to-sequence prediction problem. We proposed to
use a transformer encoder to increase the model’s capability
for learning long-range dependencies and effectively
capturing global contextual representation at multiple scales.
We validated the effectiveness of UNETR on different
volumetric segmentation tasks in CT and MRI modalities.
UNETR achieves new state-of-the-art performance in both
Standard and Free Competitions on the BTCV leaderboard
for the multi-organ segmentation and outperforms competing
approaches for brain tumor and spleen segmentation on the
MSD dataset. In conclusion, UNETR has shown the potential
to effectively learn the critical anatomical relationships
represented in medical images. The proposed method could
be the foundation for a new class of transformer-based
segmentation models in medical images analysis.
- >-
[ABSTRACT]
While the Transformer architecture has become the de-facto standard for
natural
language processing tasks, its applications to computer vision remain
limited. In
vision, attention is either applied in conjunction with convolutional
networks, or
used to replace certain components of convolutional networks while
keeping their
overall structure in place. We show that this reliance on CNNs is not
necessary
and a pure transformer applied directly to sequences of image patches
can perform
very well on image classification tasks. When pre-trained on large
amounts of
data and transferred to multiple mid-sized or small image recognition
benchmarks
(ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains
excellent
results compared to state-of-the-art convolutional networks while
requiring substantially fewer computational resources to train.1
- source_sentence: >-
Are there any frameworks that adapt to different types of image
segmentation tasks?
sentences:
- "[ABSTRACT]\nDeeper neural networks are more difficult to train. We\npresent a residual learning framework to ease the training\nof networks that are substantially deeper than those used\npreviously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual\nnetworks are easier to optimize, and can gain accuracy from\nconsiderably increased depth. On the ImageNet dataset we\nevaluate residual nets with a depth of up to 152 layers—8 ×\ndeeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error\non the ImageNettest set. This result won the 1st place on the\nILSVRC 2015 classification task. We also present analysis\non CIFAR-10 with 100 and 1000 layers.\nThe depth of representations is of central importance\nfor many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep\nresidual nets are foundations of our submissions to ILSVRC\n& COCO 2015 competitions 1, where we also won the 1st\nplaces on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.\n\n[1. Introduction]\nDeep convolutional neural networks [22, 21] have led\nto a series of breakthroughs for image classification [21,\n50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched\nby the number of stacked layers (depth). Recent evidence\n[41, 44] reveals that network depth is of crucial importance,\nand the leading results [41, 44, 13, 16] on the challenging\nImageNet dataset [36] all exploit “very deep” [41] models,\nwith a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also\n1http://image-net.org/challenges/LSVRC/2015/ and\nhttp://mscoco.org/dataset/#detections-challenge2015.\n0 1 2 3 4 5 60 \n\niter. (1e4)\ntraining error (%)\n \n \n0 1 2 3 4 5 60\n\niter. (1e4)\ntest error (%)\n \n \n56-layer\n20-layer\n56-layer\n20-layer\nFigure 1. Training error (left) and test error (right) on CIFAR-10\nwith 20-layer and 56-layer “plain” networks. The deeper network\nhas higher training error, and thus test error. Similar phenomena\non ImageNet is presented in Fig. 4.\ngreatly benefited from very deep models.\nDriven by the significance of depth, a question arises: Is\nlearning better networks as easy as stacking more layers?\nAn obstacle to answering this question was the notorious\nproblem of vanishing/exploding gradients [1, 9], which\nhamper convergence from the beginning. This problem,\nhowever, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers\n[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].\nWhen deeper networks are able to start converging, a\ndegradation problem has been exposed: with the network\ndepth increasing, accuracy gets saturated (which might be\nunsurprising) and then degrades rapidly. Unexpectedly,\nsuch degradation is not caused by overfitting , and adding\nmore layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by\nour experiments. Fig. 1 shows a typical example.\nThe degradation (of training accuracy) indicates that not\nall systems are similarly easy to optimize. Let us consider a\nshallower architecture and its deeper counterpart that adds\nmore layers onto it. There exists a solution by construction\nto the deeper model: the added layers are identity mapping,\nand the other layers are copied from the learned shallower\nmodel. The existence of this constructed solution indicates\nthat a deeper model should produce no higher training error\nthan its shallower counterpart. But experiments show that\nour current solvers on hand are unable to find solutions that\n\narXiv:1512.03385v1 [cs.CV] 10 Dec 2015\nidentity\nweight layer\nweight layer\nrelu\nreluF(x)\x01+\x01x\nx\nF(x) x\nFigure 2. Residual learning: a building block.\nare comparably good or better than the constructed solution\n(or unable to do so in feasible time).\nIn this paper, we address the degradation problem by\nintroducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a\ndesired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired\nunderlying mapping as H(x), we let the stacked nonlinear\nlayers fit another mapping of F(x) :=H(x) −x. The original mapping is recast intoF(x)+ x. We hypothesize that it\nis easier to optimize the residual mapping than to optimize\nthe original, unreferenced mapping. To the extreme, if an\nidentity mapping were optimal, it would be easier to push\nthe residual to zero than to fit an identity mapping by a stack\nof nonlinear layers.\nThe formulation of F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).\nShortcut connections [2, 34, 49] are those skipping one or\nmore layers. In our case, the shortcut connections simply\nperform identity mapping, and their outputs are added to\nthe outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained\nend-to-end by SGD with backpropagation, and can be easily implemented using common libraries ( e.g., Caffe [19])\nwithout modifying the solvers.\nWe present comprehensive experiments on ImageNet\n[36] to show the degradation problem and evaluate our\nmethod. We show that: 1) Our extremely deep residual nets\nare easy to optimize, but the counterpart “plain” nets (that\nsimply stack layers) exhibit higher training error when the\ndepth increases; 2) Our deep residual nets can easily enjoy\naccuracy gains from greatly increased depth, producing results substantially better than previous networks.\nSimilar phenomena are also shown on the CIFAR-10 set\n[20], suggesting that the optimization difficulties and the\neffects of our method are not just akin to a particular dataset.\nWe present successfully trained models on this dataset with\nover 100 layers, and explore models with over 1000 layers.\nOn the ImageNet classification dataset [36], we obtain\nexcellent results by extremely deep residual nets. Our 152layer residual net is the deepest network ever presented on\nImageNet, while still having lower complexity than VGG\nnets [41]. Our ensemble has 3.57% top-5 error on the\nImageNet test set, and won the 1st place in the ILSVRC\n2015 classification competition . The extremely deep representations also have excellent generalization performance\non other recognition tasks, and lead us to further win the\n1st places on: ImageNet detection, ImageNet localization,\nCOCO detection, and COCO segmentation in ILSVRC &\nCOCO 2015 competitions. This strong evidence shows that\nthe residual learning principle is generic, and we expect that\nit is applicable in other vision and non-vision problems.\n\n[3.3. Network Architectures]\nWe have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.\nPlain Network. Our plain baselines (Fig. 3, middle) are\nmainly inspired by the philosophy of VGG nets [41] (Fig. 3,\nleft). The convolutional layers mostly have 3 ×3 filters and\nfollow two simple design rules: (i) for the same output\nfeature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by\nconvolutional layers that have a stride of 2. The network\nends with a global average pooling layer and a 1000-way\nfully-connected layer with softmax. The total number of\nweighted layers is 34 in Fig. 3 (middle).\nIt is worth noticing that our model has fewer filters and\nlower complexity than VGG nets [41] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which\nis only 18% of VGG-19 (19.6 billion FLOPs).\n\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n3x3 conv, 512\n3x3 conv, 64\n3x3 conv, 64\npool, /2\n3x3 conv, 128\n3x3 conv, 128\npool, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\nfc 4096\nfc 4096\nfc 1000\nimage\noutput \nsize: 112\noutput \nsize: 224\noutput \nsize: 56\noutput \nsize: 28\noutput \nsize: 14\noutput \nsize: 7\noutput \nsize: 1\nVGG-19 34-layer plain\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n34-layer residual\nFigure 3. Example network architectures for ImageNet. Left: the\nVGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).\nRight: a residual network with 34 parameter layers (3.6 billion\nFLOPs). The dotted shortcuts increase dimensions. Table 1shows\nmore details and other variants.\nResidual Network. Based on the above plain network, we\ninsert shortcut connections (Fig. 3, right) which turn the\nnetwork into its counterpart residual version. The identity\nshortcuts (Eqn.(1)) can be directly used when the input and\noutput are of the same dimensions (solid line shortcuts in\nFig. 3). When the dimensions increase (dotted line shortcuts\nin Fig. 3), we consider two options: (A) The shortcut still\nperforms identity mapping, with extra zero entries padded\nfor increasing dimensions. This option introduces no extra\nparameter; (B) The projection shortcut in Eqn.(2) is used to\nmatch dimensions (done by 1 ×1 convolutions). For both\noptions, when the shortcuts go across feature maps of two\nsizes, they are performed with a stride of 2."
- "[ABSTRACT]\nDeeper neural networks are more difficult to train. We\npresent a residual learning framework to ease the training\nof networks that are substantially deeper than those used\npreviously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual\nnetworks are easier to optimize, and can gain accuracy from\nconsiderably increased depth. On the ImageNet dataset we\nevaluate residual nets with a depth of up to 152 layers—8 ×\ndeeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error\non the ImageNettest set. This result won the 1st place on the\nILSVRC 2015 classification task. We also present analysis\non CIFAR-10 with 100 and 1000 layers.\nThe depth of representations is of central importance\nfor many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep\nresidual nets are foundations of our submissions to ILSVRC\n& COCO 2015 competitions 1, where we also won the 1st\nplaces on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.\n\n[1. Introduction]\nDeep convolutional neural networks [22, 21] have led\nto a series of breakthroughs for image classification [21,\n50, 40]. Deep networks naturally integrate low/mid/highlevel features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched\nby the number of stacked layers (depth). Recent evidence\n[41, 44] reveals that network depth is of crucial importance,\nand the leading results [41, 44, 13, 16] on the challenging\nImageNet dataset [36] all exploit “very deep” [41] models,\nwith a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also\n1http://image-net.org/challenges/LSVRC/2015/ and\nhttp://mscoco.org/dataset/#detections-challenge2015.\n0 1 2 3 4 5 60 \n\niter. (1e4)\ntraining error (%)\n \n \n0 1 2 3 4 5 60\n\niter. (1e4)\ntest error (%)\n \n \n56-layer\n20-layer\n56-layer\n20-layer\nFigure 1. Training error (left) and test error (right) on CIFAR-10\nwith 20-layer and 56-layer “plain” networks. The deeper network\nhas higher training error, and thus test error. Similar phenomena\non ImageNet is presented in Fig. 4.\ngreatly benefited from very deep models.\nDriven by the significance of depth, a question arises: Is\nlearning better networks as easy as stacking more layers?\nAn obstacle to answering this question was the notorious\nproblem of vanishing/exploding gradients [1, 9], which\nhamper convergence from the beginning. This problem,\nhowever, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers\n[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].\nWhen deeper networks are able to start converging, a\ndegradation problem has been exposed: with the network\ndepth increasing, accuracy gets saturated (which might be\nunsurprising) and then degrades rapidly. Unexpectedly,\nsuch degradation is not caused by overfitting , and adding\nmore layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by\nour experiments. Fig. 1 shows a typical example.\nThe degradation (of training accuracy) indicates that not\nall systems are similarly easy to optimize. Let us consider a\nshallower architecture and its deeper counterpart that adds\nmore layers onto it. There exists a solution by construction\nto the deeper model: the added layers are identity mapping,\nand the other layers are copied from the learned shallower\nmodel. The existence of this constructed solution indicates\nthat a deeper model should produce no higher training error\nthan its shallower counterpart. But experiments show that\nour current solvers on hand are unable to find solutions that\n\narXiv:1512.03385v1 [cs.CV] 10 Dec 2015\nidentity\nweight layer\nweight layer\nrelu\nreluF(x)\x01+\x01x\nx\nF(x) x\nFigure 2. Residual learning: a building block.\nare comparably good or better than the constructed solution\n(or unable to do so in feasible time).\nIn this paper, we address the degradation problem by\nintroducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a\ndesired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired\nunderlying mapping as H(x), we let the stacked nonlinear\nlayers fit another mapping of F(x) :=H(x) −x. The original mapping is recast intoF(x)+ x. We hypothesize that it\nis easier to optimize the residual mapping than to optimize\nthe original, unreferenced mapping. To the extreme, if an\nidentity mapping were optimal, it would be easier to push\nthe residual to zero than to fit an identity mapping by a stack\nof nonlinear layers.\nThe formulation of F(x) +x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).\nShortcut connections [2, 34, 49] are those skipping one or\nmore layers. In our case, the shortcut connections simply\nperform identity mapping, and their outputs are added to\nthe outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained\nend-to-end by SGD with backpropagation, and can be easily implemented using common libraries ( e.g., Caffe [19])\nwithout modifying the solvers.\nWe present comprehensive experiments on ImageNet\n[36] to show the degradation problem and evaluate our\nmethod. We show that: 1) Our extremely deep residual nets\nare easy to optimize, but the counterpart “plain” nets (that\nsimply stack layers) exhibit higher training error when the\ndepth increases; 2) Our deep residual nets can easily enjoy\naccuracy gains from greatly increased depth, producing results substantially better than previous networks.\nSimilar phenomena are also shown on the CIFAR-10 set\n[20], suggesting that the optimization difficulties and the\neffects of our method are not just akin to a particular dataset.\nWe present successfully trained models on this dataset with\nover 100 layers, and explore models with over 1000 layers.\nOn the ImageNet classification dataset [36], we obtain\nexcellent results by extremely deep residual nets. Our 152layer residual net is the deepest network ever presented on\nImageNet, while still having lower complexity than VGG\nnets [41]. Our ensemble has 3.57% top-5 error on the\nImageNet test set, and won the 1st place in the ILSVRC\n2015 classification competition . The extremely deep representations also have excellent generalization performance\non other recognition tasks, and lead us to further win the\n1st places on: ImageNet detection, ImageNet localization,\nCOCO detection, and COCO segmentation in ILSVRC &\nCOCO 2015 competitions. This strong evidence shows that\nthe residual learning principle is generic, and we expect that\nit is applicable in other vision and non-vision problems.\n\n[3.3. Network Architectures]\nWe have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.\nPlain Network. Our plain baselines (Fig. 3, middle) are\nmainly inspired by the philosophy of VGG nets [41] (Fig. 3,\nleft). The convolutional layers mostly have 3 ×3 filters and\nfollow two simple design rules: (i) for the same output\nfeature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by\nconvolutional layers that have a stride of 2. The network\nends with a global average pooling layer and a 1000-way\nfully-connected layer with softmax. The total number of\nweighted layers is 34 in Fig. 3 (middle).\nIt is worth noticing that our model has fewer filters and\nlower complexity than VGG nets [41] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which\nis only 18% of VGG-19 (19.6 billion FLOPs).\n\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n3x3 conv, 512\n3x3 conv, 64\n3x3 conv, 64\npool, /2\n3x3 conv, 128\n3x3 conv, 128\npool, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\npool, /2\nfc 4096\nfc 4096\nfc 1000\nimage\noutput \nsize: 112\noutput \nsize: 224\noutput \nsize: 56\noutput \nsize: 28\noutput \nsize: 14\noutput \nsize: 7\noutput \nsize: 1\nVGG-19 34-layer plain\n7x7 conv, 64, /2\npool, /2\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 64\n3x3 conv, 128, /2\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 128\n3x3 conv, 256, /2\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 256\n3x3 conv, 512, /2\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\n3x3 conv, 512\navg pool\nfc 1000\nimage\n34-layer residual\nFigure 3. Example network architectures for ImageNet. Left: the\nVGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).\nRight: a residual network with 34 parameter layers (3.6 billion\nFLOPs). The dotted shortcuts increase dimensions. Table 1shows\nmore details and other variants.\nResidual Network. Based on the above plain network, we\ninsert shortcut connections (Fig. 3, right) which turn the\nnetwork into its counterpart residual version. The identity\nshortcuts (Eqn.(1)) can be directly used when the input and\noutput are of the same dimensions (solid line shortcuts in\nFig. 3). When the dimensions increase (dotted line shortcuts\nin Fig. 3), we consider two options: (A) The shortcut still\nperforms identity mapping, with extra zero entries padded\nfor increasing dimensions. This option introduces no extra\nparameter; (B) The projection shortcut in Eqn.(2) is used to\nmatch dimensions (done by 1 ×1 convolutions). For both\noptions, when the shortcuts go across feature maps of two\nsizes, they are performed with a stride of 2."
- >-
[Preamble]
nnU-Net: Self-adapting Framework
for U-Net-Based Medical Image Segmentation
Fabian Isensee, Jens Petersen, Andre Klein, David Zimmerer, Paul F.
Jaeger,
Simon Kohl, Jakob Wasserthal, Gregor K¨ ohler, Tobias Norajitra,
Sebastian
Wirkert, and Klaus H. Maier-Hein
Division of Medical Image Computing, German Cancer Research Center
(DKFZ),
Heidelberg, Germany
Abstract. The U-Net was presented in 2015. With its straight-forward
and successful architecture it quickly evolved to a commonly used
benchmark in medical image segmentation. The adaptation of the U-Net to
novel problems, however, comprises several degrees of freedom regarding
the exact architecture, pre-processing, training and inference. These
choices are not independent of each other and substantially impact the
overall performance. The present paper introduces the nnU-Net
(”nonew-Net”), which refers to a robust and self-adapting framework on
the
basis of 2D and 3D vanilla U-Nets. We argue the strong case for taking
away superfluous bells and whistles of many proposed network designs
and instead focus on the remaining aspects that make out the performance
and generalizability of a method. We evaluate the nnU-Net in the
context of the Medical Segmentation Decathlon challenge, which measures
segmentation performance in ten disciplines comprising distinct
entities, image modalities, image geometries and dataset sizes, with no
manual adjustments between datasets allowed. At the time of manuscript
submission, nnU-Net achieves the highest mean dice scores across all
classes and seven phase 1 tasks (except class 1 in BrainTumour) in the
online leaderboard of the challenge.
Keywords: Semantic Segmentation, Medical Imaging, U-Net
[1 Introduction]
Medical Image Segmentation is currently dominated by deep convolutional
neural networks (CNNs). However, each segmentation benchmark seems to
require
specialized architectures and training scheme modifications to achieve
competitive performance [1,2,3,4,5]. This results in huge amounts of
publications in the
field that, alongside often limited validation on only few or even just a
single
dataset, make it increasingly difficult for researchers to identify
methods that live
up to their promised superiority beyond the limited scenarios they are
demonstrated on. The Medical Segmentation Decathlon is intended to
specifically address this issue: participants in this challenge are asked
to create a segmentation
algorithm that generalizes across 10 datasets corresponding to different
entities
arXiv:1809.10486v1 [cs.CV] 27 Sep 2018
of the human body. These algorithms may dynamically adapt to the
specifics
of a particular dataset, but are only allowed to do so in a fully
automatic manner. The challenge is split into two successive phases: 1)
a development phase in
which participants are given access to 7 datasets to optimize their
approach on
and, using their final and thus frozen method, must submit segmentations
for
the corresponding 7 held-out test sets. 2) a second phase to evaluate
the same
exact method on 3 previously undisclosed datasets.
We hypothesize that some of the architectural modifications presented
recently are in part overfitted to specific problems or could suffer from
imperfect
validation that results from sub-optimal reimplementations of the
state-of-theart. Using the U-Net as a benchmark on an in-house dataset,
for example, requires the adaptation of the method to the novel problem.
This spans several
degrees of freedom. Even though the architecture itself is quite
straight-forward,
and even though the method is quite commonly used as a benchmark, we
believe
that the remaining interdependent choices regarding the exact
architecture, preprocessing, training, inference and post-processing
quite often cause the U-Net
to underperform when used as a benchmark. Additionally, architectural
tweaks
that are intended to improve the performance of a network can rather
easily
be demonstrated to work if the network is not yet fully optimized for
the task
at hand, allowing for plenty of headroom for the tweak to improve
results. In
our own preliminary experiments, these tweaks however were unable to
improve
segmentation results in fully optimized networks and thus most likely
unable
to advance the state of the art. This leads us to believe that the
influence of
non-architectural aspects in segmentation methods is much more
impactful, but
at the same time also severely underestimated.
In this paper, we present the nnU-Net (”no-new-Net”) framework. It
resides
on a set of three comparatively simple U-Net models that contain only
minor
modifications to the original U-Net [6]. We omit recently proposed
extensions
such as for example the use of residual connections [7,8], dense
connections [5] or
attention mechanisms [4]. The nnU-Net automatically adapts its
architectures
to the given image geometry. More importantly though, the nnU-Net
framework
thoroughly defines all the other steps around them. These are steps where
much
of the nets’ performance can be gained or respectively lost:
preprocessing (e.g.
resampling and normalization), training (e.g. loss, optimizer setting
and data
augmentation), inference (e.g. patch-based strategy and ensembling
across testtime augmentations and models) and a potential
post-processing (e.g. enforcing
single connected components if applicable).
[2.1 Network Architectures]
Medical images commonly encompass a third dimension, which is why we
consider a pool of basic U-Net architectures consisting of a 2D U-Net, a
3D U-Net
and a U-Net Cascade. While the 2D and 3D U-Nets generate segmentations
at full resolution, the cascade first generates low resolution
segmentations and
subsequently refines them. Our architectural modifications as compared to
the
U-Net’s original formulation are close to negligible and instead we
focus our
efforts on designing an automatic training pipeline for these models.
The U-Net [6] is a successful encoder-decoder network that has received
a lot
of attention in the recent years. Its encoder part works similarly to a
traditional
classification CNN in that it successively aggregates semantic
information at the
expense of reduced spatial information. Since in segmentation, both
semantic as
well as spatial information are crucial for the success of a network,
the missing
spatial information must somehow be recovered. The U-Net does this
through
the decoder, which receives semantic information from the bottom of the
’U’
and recombines it with higher resolution feature maps obtained directly
from
the encoder through skip connections. Unlike other segmentation
networks, such
as FCN [9] and previous iterations of DeepLab [10] this allows the U-Net
to
segment fine structures particularly well.
Just like the original U-Net, we use two plain convolutional layers
between
poolings in the encoder and transposed convolution operations in the
decoder.
We deviate from the original architecture in that we replace ReLU
activation
functions with leaky ReLUs (neg. slope 1e−2) and use instance
normalization [11]
instead of the more popular batch normalization [12].
2D U-Net Intuitively, using a 2D U-Net in the context of 3D medical
image segmentation appears to be suboptimal because valuable information
along
the z-axis cannot be aggregated and taken into consideration. However,
there
is evidence [13] that conventional 3D segmentation methods deteriorate
in performance if the dataset is anisotropic (cf. Prostate dataset of
the Decathlon
challenge).
3D U-Net A 3D U-Net seems like the appropriate method of choice for 3D
image data. In an ideal world, we would train such an architecture on
the entire
patient’s image. In reality however, we are limited by the amount of
available
GPU memory which allows us to train this architecture only on image
patches.
While this is not a problem for datasets comprised of smaller images (in
terms
of number of voxels per patient) such as the Brain Tumour, Hippocampus
and
Prostate datasets of this challenge, patch-based training, as dictated
by datasets
with large images such as Liver, may impede training. This is due to the
limited
field of view of the architecture which thus cannot collect sufficient
contextual
information to e.g. correctly distinguish parts of a liver from parts of
other
organs.
U-Net Cascade To address this practical shortcoming of a 3D U-Net on
datasets with large image sizes, we additionally propose a cascaded
model. Therefore, a 3D U-Net is first trained on downsampled images
(stage 1). The segmentation results of this U-Net are then upsampled to
the original voxel spacing and
passed as additional (one hot encoded) input channels to a second 3D
U-Net,
which is trained on patches at full resolution (stage 2). See Figure 1.
up/downsampling croppingfull res. image low res. seg. full res. seg.
stage 1 stage 2
skip conn.
Fig. 1.U-Net Cascade (on applicable datasets only). Stage 1 (left): a 3D
U-Net processes downsampled data, the resulting segmentation maps are
upsampled to the original resolution. Stage 2 (right): these
segmentations are concatenated as one-hot encodings to the full
resolution data and refined by a second 3D U-Net.
Dynamic adaptation of network topologiesDue to the large differences
in image size (median shape 482 ×512 ×512 for Liver vs. 36 ×50 ×35 for
Hippocampus) the input patch size and number of pooling operations per
axis
(and thus implicitly the number of convolutional layers) must be
automatically
adapted for each dataset to allow for adequate aggregation of spatial
information.
Apart from adapting to the image geometries, there are technical
constraints like
the available memory to account for. Our guiding principle in this
respect is to
dynamically trade off the batch-size versus the network capacity,
presented in
detail below:
We start out with network configurations that we know to be working with
our hardware setup. For the 2D U-Net this configuration is an input patch
size of
256×256, a batch size of 42 and 30 feature maps in the highest layers
(number of
feature maps doubles with each downsampling). We automatically adapt
these
parameters to the median plane size of each dataset (where we use the
plane
with the lowest in-plane spacing, corresponding to the highest
resolution), so
that the network effectively trains on entire slices. We configure the
networks to
pool along each axis until the feature map size for that axis is smaller
than 8 (but
not more than a maximum of 6 pooling operations). Just like the 2D
U-Net, our
3D U-Net uses 30 feature maps at the highest resolution layers. Here we
start
with a base configuration of input patch size 128 ×128 ×128, and a batch
size
of 2. Due to memory constraints, we do not increase the input patch
volume
beyond 1283 voxels, but instead match the aspect ratio of the input
patch size
to that of the median size of the dataset in voxels. If the median shape
of the
dataset is smaller than 128 3 then we use the median shape as input
patch size
and increase the batch size (so that the total number of voxels
processed is the
same as with 128 ×128 ×128 and a batch size of 2). Just like for the 2D
U-Net
we pool (for a maximum of 5 times) along each axis until the feature
maps have
size 8.
For any network we limit the total number of voxels processed per
optimizer
step (defined as the input patch volume times the batch size) to a
maximum of
2D U-Net 3D U-Net 3D U-Net lowres
BrainTumour
median patient shape 169x138 138x169x138 input patch size 192x160
128x128x128 batch size 89 2 num pool per axis 5, 5 5, 5, 5 Heart
median patient shape 320x232 115x320x232 58x160x116
input patch size 320x256 80x192x128 64x160x128
batch size 33 2 2
num pool per axis 6, 6 4, 5, 5 4, 5, 5
Liver
median patient shape 512x512 482x512x512 121x128x128
input patch size 512x512 128x128x128 128x128x128
batch size 10 2 2
num pool per axis 6, 6 5, 5, 5 5, 5, 5
Hippocampus
median patient shape 50x35 36x50x35 input patch size 56x40 40x56x40
batch size 366 9 num pool per axis 3, 3 3, 3, 3 Prostate
median patient shape 320x319 20x320x319 input patch size 320x320
20x192x192 batch size 26 4 num pool per axis 6, 6 2, 5, 5 Lung
median patient shape 512x512 252x512x512 126x256x256
input patch size 512x512 112x128x128 112x128x128
batch size 10 2 2
num pool per axis 6, 6 4, 5, 5 4, 5, 5
Pancreas
median patient shape 512x512 96x512x512 96x256x256
input patch size 512x512 96x160x128 96x160x128
batch size 10 2 2
num pool per axis 6, 6 4, 5, 5 4, 5, 5
Table 1.Network topologies as automatically generated for the seven
phase 1 tasks
of the Medical Segmentation Decathlon challenge. 3D U-Net lowres refers
to the first
stage of the U-Net Cascade. The configuration of the second stage of the
U-Net Cascade
is identical to the 3D U-Net.
5% of the dataset. For cases in excess, we reduce the batch size (with a
lowerbound of 2).
All network topologies generated for the phase 1 datasets are presented
in
table 2.1.
- source_sentence: What is the role of online and target networks in the BYOL architecture?
sentences:
- >-
[ABSTRACT]
Relying entirely on an attention mechanism,
the Transformer introduced by Vaswani et
al. (2017) achieves state-of-the-art results for
machine translation. In contrast to recurrent
and convolutional neural networks, it does
not explicitly model relative or absolute position information in its
structure. Instead,
it requires adding representations of absolute positions to its inputs.
In this work
we present an alternative approach, extending the self-attention
mechanism to efficiently
consider representations of the relative positions, or distances between
sequence elements.
On the WMT 2014 English-to-German and
English-to-French translation tasks, this approach yields improvements
of 1.3 BLEU and
0.3 BLEU over absolute position representations, respectively. Notably,
we observe that
combining relative and absolute position representations yields no
further improvement in
translation quality. We describe an efficient
implementation of our method and cast it as an
instance of relation-aware self-attention mechanisms that can generalize
to arbitrary graphlabeled inputs.
[1 Introduction]
Recent approaches to sequence to sequence learning typically leverage
recurrence (Sutskever et al.,
2014), convolution (Gehring et al., 2017; Kalchbrenner et al., 2016),
attention (Vaswani et al.,
2017), or a combination of recurrence and attention (Bahdanau et al.,
2014; Cho et al., 2014; Luong et al., 2015; Wu et al., 2016) as basic
building
blocks. These approaches incorporate information
about the sequential position of elements differently.
Recurrent neural networks (RNNs) typically
compute a hidden state ht, as a function of their
input at time t and a previous hidden state ht−1,
capturing relative and absolute positions along the
time dimension directly through their sequential
structure. Non-recurrent models do not necessarily consider input
elements sequentially and may
hence require explicitly encoding position information to be able to use
sequence order.
One common approach is to use position encodings which are combined with
input elements to
expose position information to the model. These
position encodings can be a deterministic function of position
(Sukhbaatar et al., 2015; Vaswani
et al., 2017) or learned representations. Convolutional neural networks
inherently capture relative
positions within the kernel size of each convolution. They have been
shown to still benefit from
position encodings (Gehring et al., 2017), however.
For the Transformer, which employs neither
convolution nor recurrence, incorporating explicit
representations of position information is an especially important
consideration since the model is
otherwise entirely invariant to sequence ordering.
Attention-based models have therefore used position encodings or biased
attention weights based
on distance (Parikh et al., 2016).
In this work we present an efficient way of
incorporating relative position representations in
the self-attention mechanism of the Transformer.
Even when entirely replacing its absolute position
encodings, we demonstrate significant improvements in translation quality
on two machine translation tasks.
Our approach can be cast as a special case of extending the
self-attention mechanism of the Transformer to considering arbitrary
relations between
any two elements of the input, a direction we plan
to explore in future work on modeling labeled, directed graphs.
arXiv:1803.02155v2 [cs.CL] 12 Apr 2018
[5 Conclusions]
In this paper we presented an extension to selfattention that can be
used to incorporate relative position information for sequences, which
improves performance for machine translation.
For future work, we plan to extend this mechanism to consider arbitrary
directed, labeled graph
inputs to the Transformer. We are also interested in nonlinear
compatibility functions to combine input representations and edge
representations. For both of these extensions, a key consideration will
be determining efficient implementations.
- >-
[ABSTRACT]
We introduce Bootstrap Your Own Latent (BYOL), a new approach to
self-supervised image
representation learning. BYOL relies on two neural networks, referred to
as online and target
networks, that interact and learn from each other. From an augmented
view of an image, we train
the online network to predict the target network representation of the
same image under a different
augmented view. At the same time, we update the target network with a
slow-moving average
of the online network. While state-of-the art methods rely on negative
pairs, BYOL achieves a
new state of the art without them. BYOL reaches 74.3% top-1
classification accuracy on ImageNet
using a linear evaluation with a ResNet-50 architecture and 79.6% with a
larger ResNet. We
show that BYOL performs on par or better than the current state of the
art on both transfer and
semi-supervised benchmarks. Our implementation and pretrained models are
given on GitHub.3
[1 Introduction]
25M 50M 100M 200M 400M
Number of parameters
80ImageNet top-1 accuracy (%)
Sup.
Sup.(2×)
Sup.(4×)
InfoMin
SimCLR
SimCLR (2×)
SimCLR (4×)
MoCov2 CPCv2-L
MoCo
CMC
AMDIM
BYOL
BYOL (2×)
BYOL (4×)
BYOL (200-2×)
Sup.(200-2×)
Figure 1: Performance of BYOL on ImageNet (linear evaluation) using
ResNet-50 and our best architecture ResNet200 (2×), compared to other
unsupervised and supervised
(Sup .) baselines [8].
Learning good image representations is a key challenge
in computer vision [ 1, 2, 3] as it allows for efficient
training on downstream tasks [ 4, 5, 6, 7]. Many different training
approaches have been proposed to learn
such representations, usually relying on visual pretext
tasks. Among them, state-of-the-art contrastive methods [8, 9, 10, 11,
12] are trained by reducing the distance between representations of
different augmented
views of the same image (‘positive pairs’), and increasing the distance
between representations of augmented
views from different images (‘negative pairs’). These
methods need careful treatment of negative pairs [13]
by either relying on large batch sizes [8, 12], memory
banks [9] or customized mining strategies [14, 15] to retrieve the
negative pairs. In addition, their performance
critically depends on the choice of image augmentations [8, 12].
In this paper, we introduce Bootstrap Your Own
Latent ( BYOL), a new algorithm for self-supervised
learning of image representations. BYOL achieves higher
performance than state-of-the-art contrastive methods
∗Equal contribution; the order of first authors was randomly selected.
3https://github.com/deepmind/deepmind-research/tree/master/byol
arXiv:2006.07733v3 [cs.LG] 10 Sep 2020
without using negative pairs. It iteratively bootstraps4 the outputs of
a network to serve as targets for an enhanced
representation. Moreover, BYOL is more robust to the choice of image
augmentations than contrastive methods; we
suspect that not relying on negative pairs is one of the leading reasons
for its improved robustness. While previous
methods based on bootstrapping have used pseudo-labels [16], cluster
indices [17] or a handful of labels [18, 19, 20],
we propose to directly bootstrap the representations. In particular,
BYOL uses two neural networks, referred to
as online and target networks, that interact and learn from each other.
Starting from an augmented view of an
image, BYOL trains its online network to predict the target network’s
representation of another augmented view of
the same image. While this objective admits collapsed solutions, e.g.,
outputting the same vector for all images,
we empirically show that BYOL does not converge to such solutions. We
hypothesize (see Section 3.2) that the
combination of (i) the addition of a predictor to the online network and
(ii) the use of a slow-moving average of
the online parameters as the target network encourages encoding more and
more information within the online
projection and avoids collapsed solutions.
We evaluate the representation learned by BYOL on ImageNet [ 21] and
other vision benchmarks using ResNet
architectures [22]. Under the linear evaluation protocol on ImageNet,
consisting in training a linear classifier on
top of the frozen representation, BYOL reaches 74.3% top-1 accuracy with
a standard ResNet-50 and 79.6% top-1
accuracy with a larger ResNet (Figure 1). In the semi-supervised and
transfer settings on ImageNet, we obtain results
on par or superior to the current state of the art. Our contributions
are: ( i) We introduce BYOL, a self-supervised
representation learning method (Section 3) which achieves
state-of-the-art results under the linear evaluation
protocol on ImageNet without using negative pairs. (ii) We show that our
learned representation outperforms the
state of the art on semi-supervised and transfer benchmarks (Section 4).
(iii) We show that BYOL is more resilient to
changes in the batch size and in the set of image augmentations compared
to its contrastive counterparts (Section 5).
In particular, BYOL suffers a much smaller performance drop than SimCLR,
a strong contrastive baseline, when only
using random crops as image augmentations.
[3 Method]
We start by motivating our method before explaining its details in
Section 3.1. Many successful self-supervised
learning approaches build upon the cross-view prediction framework
introduced in [63]. Typically, these approaches
learn representations by predicting different views (e.g., different
random crops) of the same image from one
another. Many such approaches cast the prediction problem directly in
representation space: the representation of
an augmented view of an image should be predictive of the representation
of another augmented view of the same
image. However, predicting directly in representation space can lead to
collapsed representations: for instance, a
representation that is constant across views is always fully predictive
of itself. Contrastive methods circumvent
this problem by reformulating the prediction problem into one of
discrimination: from the representation of an
augmented view, they learn to discriminate between the representation of
another augmented view of the same
image, and the representations of augmented views of different images.
In the vast majority of cases, this prevents
the training from finding collapsed representations. Yet, this
discriminative approach typically requires comparing
each representation of an augmented view with many negative examples, to
find ones sufficiently close to make the
discrimination task challenging. In this work, we thus tasked ourselves
to find out whether these negative examples
are indispensable to prevent collapsing while preserving high
performance.
To prevent collapse, a straightforward solution is to use a fixed
randomly initialized network to produce the targets
for our predictions. While avoiding collapse, it empirically does not
result in very good representations. Nonetheless,
it is interesting to note that the representation obtained using this
procedure can already be much better than the
initial fixed representation. In our ablation study (Section 5), we apply
this procedure by predicting a fixed randomly
initialized network and achieve 18.8% top-1 accuracy (Table 5a) on the
linear evaluation protocol on ImageNet,
whereas the randomly initialized network only achieves 1.4% by itself.
This experimental finding is the core
motivation for BYOL: from a given representation, referred to as target,
we can train a new, potentially enhanced
representation, referred to as online, by predicting the target
representation. From there, we can expect to build a
sequence of representations of increasing quality by iterating this
procedure, using subsequent online networks as
new target networks for further training. In practice, BYOL generalizes
this bootstrapping procedure by iteratively
refining its representation, but using a slowly moving exponential
average of the online network as the target network
instead of fixed checkpoints.
[6 Conclusion]
We introduced BYOL, a new algorithm for self-supervised learning of
image representations. BYOL learns its
representation by predicting previous versions of its outputs, without
using negative pairs. We show that BYOL
achieves state-of-the-art results on various benchmarks. In particular,
under the linear evaluation protocol on
ImageNet with a ResNet- 50 (1×), BYOL achieves a new state of the art
and bridges most of the remaining gap
between self-supervised methods and the supervised learning baseline of
[ 8]. Using a ResNet- 200 (2×), BYOL
reaches a top-1 accuracy of 79.6% which improves over the previous state
of the art (76.8%) while using 30% fewer
parameters.
Nevertheless, BYOL remains dependent on existing sets of augmentations
that are specific to vision applications.
To generalize BYOL to other modalities (e.g., audio, video, text, . . .
) it is necessary to obtain similarly suitable
augmentations for each of them. Designing such augmentations may require
significant effort and expertise.
Therefore, automating the search for these augmentations would be an
important next step to generalize BYOL to
other modalities.
Broader impact
The presented research should be categorized as research in the field of
unsupervised learning. This work may
inspire new algorithms, theoretical, and experimental investigation. The
algorithm presented here can be used for
many different vision applications and a particular use may have both
positive or negative impacts, which is known
as the dual use problem. Besides, as vision datasets could be biased,
the representation learned by BYOL could be
susceptible to replicate these biases.
Acknowledgements
The authors would like to thank the following people for their help
throughout the process of writing this paper, in
alphabetical order: Aaron van den Oord, Andrew Brock, Jason Ramapuram,
Jeffrey De Fauw, Karen Simonyan,
Katrina McKinney, Nathalie Beauguerlange, Olivier Henaff, Oriol Vinyals,
Pauline Luc, Razvan Pascanu, Sander
Dieleman, and the DeepMind team. We especially thank Jason Ramapuram and
Jeffrey De Fauw, who provided the
JAX SimCLR reproduction used throughout the paper.
- >-
[ABSTRACT]
The dominant sequence transduction models are based on complex recurrent
or
convolutional neural networks that include an encoder and a decoder. The
best
performing models also connect the encoder and decoder through an
attention
mechanism. We propose a new simple network architecture, the
Transformer,
based solely on attention mechanisms, dispensing with recurrence and
convolutions
entirely. Experiments on two machine translation tasks show these models
to
be superior in quality while being more parallelizable and requiring
significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
Englishto-German translation task, improving over the existing best
results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation
task,
our model establishes a new single-model state-of-the-art BLEU score of
41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training
costs of the
best models from the literature. We show that the Transformer
generalizes well to
other tasks by applying it successfully to English constituency parsing
both with
large and limited training data.
∗Equal contribution. Listing order is random. Jakob proposed replacing
RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and
implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed
scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the
other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model
variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was
responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless
long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly
improving results and massively accelerating
our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017),
Long Beach, CA, USA.
arXiv:1706.03762v7 [cs.CL] 2 Aug 2023
[1 Introduction]
Recurrent neural networks, long short-term memory [13] and gated
recurrent [7] neural networks
in particular, have been firmly established as state of the art
approaches in sequence modeling and
transduction problems such as language modeling and machine translation
[ 35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent
language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions
of the input and output
sequences. Aligning the positions to steps in computation time, they
generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input
for position t. This inherently
sequential nature precludes parallelization within training examples,
which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples.
Recent work has achieved
significant improvements in computational efficiency through
factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the
latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence
modeling and transduction models in various tasks, allowing modeling of
dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27],
however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing
recurrence and instead
relying entirely on an attention mechanism to draw global dependencies
between input and output.
The Transformer allows for significantly more parallelization and can
reach a new state of the art in
translation quality after being trained for as little as twelve hours on
eight P100 GPUs.
[3 Model Architecture]
Most competitive neural sequence transduction models have an
encoder-decoder structure [5, 2, 35].
Here, the encoder maps an input sequence of symbol representations (x1,
..., xn) to a sequence
of continuous representations z = (z1, ..., zn). Given z, the decoder
then generates an output
sequence (y1, ..., ym) of symbols one element at a time. At each step
the model is auto-regressive
[10], consuming the previously generated symbols as additional input
when generating the next.
Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked
self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and
right halves of Figure 1,
respectively.
[3.2.3 Applications Of Attention In Our Model]
The Transformer uses multi-head attention in three different ways:
• In "encoder-decoder attention" layers, the queries come from the
previous decoder layer,
and the memory keys and values come from the output of the encoder. This
allows every
position in the decoder to attend over all positions in the input
sequence. This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence
models such as
[38, 2, 9].
• The encoder contains self-attention layers. In a self-attention layer
all of the keys, values
and queries come from the same place, in this case, the output of the
previous layer in the
encoder. Each position in the encoder can attend to all positions in the
previous layer of the
encoder.
• Similarly, self-attention layers in the decoder allow each position in
the decoder to attend to
all positions in the decoder up to and including that position. We need
to prevent leftward
information flow in the decoder to preserve the auto-regressive
property. We implement this
inside of scaled dot-product attention by masking out (setting to −∞)
all values in the input
of the softmax which correspond to illegal connections. See Figure 2.
[Model]
BLEU Training Cost (FLOPs)
EN-DE EN-FR EN-DE EN-FR
ByteNet [18] 23.75
Deep-Att + PosUnk [39] 39.2 1.0 · 1020
GNMT + RL [38] 24.6 39.92 2.3 · 1019 1.4 · 1020
ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020
MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020
Deep-Att + PosUnk Ensemble [39] 40.4 8.0 · 1020
GNMT + RL Ensemble [38] 26.30 41.16 1.8 · 1020 1.1 · 1021
ConvS2S Ensemble [9] 26.36 41.29 7.7 · 1019 1.2 · 1021
Transformer (base model) 27.3 38.1 3.3 · 1018
Transformer (big) 28.4 41.8 2.3 · 1019
Residual Dropout We apply dropout [33] to the output of each sub-layer,
before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the
sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the
base model, we use a rate of
Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value
ϵls = 0.1 [36]. This
hurts perplexity, as the model learns to be more unsure, but improves
accuracy and BLEU score.
[7 Conclusion]
In this work, we presented the Transformer, the first sequence
transduction model based entirely on
attention, replacing the recurrent layers most commonly used in
encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly
faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German
and WMT 2014
English-to-French translation tasks, we achieve a new state of the art.
In the former task our best
model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to
apply them to other tasks. We
plan to extend the Transformer to problems involving input and output
modalities other than text and
to investigate local, restricted attention mechanisms to efficiently
handle large inputs and outputs
such as images, audio and video. Making generation less sequential is
another research goals of ours.
The code we used to train and evaluate our models is available at
https://github.com/
tensorflow/tensor2tensor.
Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws
for their fruitful
comments, corrections and inspiration.
- source_sentence: >-
Are there any frameworks that adapt to different types of image
segmentation tasks?
sentences:
- >-
[ABSTRACT]
Academic research in recommender systems has been greatly focusing on
the accuracy-related measures of recommendations. Even
when non-accuracy measures such as popularity bias, diversity,
and novelty are studied, it is often solely from the users’ perspective.
However, many real-world recommenders are often multistakeholder
environments in which the needs and interests of several stakeholders
should be addressed in the recommendation process. In this paper, we
focus on the popularity bias problem which
is a well-known property of many recommendation algorithms
where few popular items are over-recommended while the majority
of other items do not get proportional attention and address its
impact on different stakeholders. Using several recommendation
algorithms and two publicly available datasets in music and movie
domains, we empirically show the inherent popularity bias of the
algorithms and how this bias impacts different stakeholders such
as users and suppliers of the items. We also propose metrics to
measure the exposure bias of recommendation algorithms from the
perspective of different stakeholders.
KEYWORDS
Multi-sided platforms, Recommender systems, Popularity bias,
Multistakeholder recommendation
[1 Introduction]
Popularity bias is a well-known phenomenon in recommender systems:
popular items are recommended even more frequently than
their popularity would warrant, amplifying the long-tail effect already
present in many recommendation domains. Prior research
has examined the impact of this bias on some properties of the
recommenders such as aggregate diversity (aka catalog coverage)
[4, 22]. One of the consequences of the popularity bias is disfavoring
less popular items where the recommendations are not fair in
terms of the amount of exposure they give to different items with
varying degree of popularity: an exposure bias. However, as we
discuss in [1], many recommender systems are multi-stakeholder
environments in which the needs and interests of multiple stakeholders
should be taken into account in the implementation and
evaluation of such systems.
In many multi-stakeholer recommenders as described in [ 1]
two main stakeholders (or what often is being referred to as sides
in multi-sided platforms [11] ) can be identified: consumers (aka
users) and suppliers. For instance, in a music platform such as
Spotify, on one side there are users who get recommendations for
songs in which they are interested and, on the other side, there are
∗ This author also has affiliation in School of Computing, DePaul
University, Chicago,
USA, mmansou4@depaul.edu.
artists whose songs are being recommended to different users. The
popularity bias can be investigated from both sides’ perspective.
Regarding the users, not everyone has the same level of interest
in popular items. In the music domain as an example, some users
might be interested in internationally popular artists such as Drake,
Beyoncé, or Ed Sheeran and some might be more interested in
artists from their own culture that might not necessarily have the
same popularity as the aforementioned artists (such as the Iranian
musician Kayhan Kalhor) or generally they prefer certain type
of music that might not be popular among the majority of other
users (such as country music). With that being said, we expect the
personalization to handle this difference in taste but as we will see
in section 4.1 that is certainly not the case.
The suppliers also do not have the same level of popularity.
In many recommendation domains including movies, music, or
even house sharing, few suppliers have a large audience while
the majority of others may not be as popular though they still
might have their fair share of audience. Now the question is, do
recommender systems let different suppliers with varying degree
of popularity reach their desired audience? Again, the short answer
is No as we will see more details in section 4.2.
Investigating the impact of recommendation algorithms on the
exposure bias on both users and suppliers is the focus of this paper.
We study several recommendation models in terms of their inherent
popularity bias and propose metrics that can measure such impact.
[6 Conclusion]
Recommender systems are multi-stakeholder environments; in addition to
the users, some other stakeholders such as the supplier
of the items also benefit from the recommendation of their items
and gaining a larger audience. The algorithmic popularity bias can
negatively impact both users and suppliers on a recommender system
platform. In this paper, we demonstrated the severity of the
popularity bias impact on different sides of a recommender system
using several recommendation algorithms on two datasets. We also
proposed metrics to quantify the exposure bias from the perspective of
both the users and suppliers. Our experiments showed that
when the recommendations are calibrated for the users in terms of
popularity (lower U PD), it will also benefit the suppliers of the
recommendations by giving them proportional exposure (lower SPD ).
We believe, it is extremely crucial for the recommender systems
researchers to see the implications of real-world recommenders
where a single-stakeholder focus might not address all the complexities.
- "[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoffmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After fine-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adifferentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scfit pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9D𝑖º. Neural\nnetworks have proven to be powerful language models, first in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe benefits of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring efficient means of augmenting language\nmodels with a massive-scale memory without significantly increasing computations. Specifically, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkisthefirsttoshowthebenefits\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoffmann|sifre}@deepmind.com\narXiv:2112.04426v3 [cs.CL] 7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be fine-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method first\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simplified version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the first chunk only affect the last token of the first chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we filter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-fly is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We find that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three different residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classification\nhead defines an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a differentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\01º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we first split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\01 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\01º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\01¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the first\n𝑚\01 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\01 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\01 , C/a.sc¹ℎ𝑢𝑚¸𝑖\01\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the first to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\01º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\01º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢 A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢 C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢 F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻 E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻 A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻 C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻 F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is defined in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we define\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe first𝑚\01 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndefine C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \01¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simplified implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\01º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9D𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are flexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more efficient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeffStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\01º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close offtheir noses, ears, and\nin the large pond off Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nflage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, fl es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their flat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof thefirst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their flat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting differences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\01º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens"
- >-
[ABSTRACT]
We propose a novel attention gate (AG) model for medical imaging that
automatically learns to focus on target structures of varying shapes and
sizes. Models
trained with AGs implicitly learn to suppress irrelevant regions in an
input image
while highlighting salient features useful for a specific task. This
enables us to
eliminate the necessity of using explicit external tissue/organ
localisation modules
of cascaded convolutional neural networks (CNNs). AGs can be easily
integrated
into standard CNN architectures such as the U-Net model with minimal
computational overhead while increasing the model sensitivity and
prediction accuracy.
The proposed Attention U-Net architecture is evaluated on two large CT
abdominal
datasets for multi-class image segmentation. Experimental results show
that AGs
consistently improve the prediction performance of U-Net across
different datasets
and training sizes while preserving computational efficiency. The source
code for
the proposed architecture is publicly available.
[1 Introduction]
Automated medical image segmentation has been extensively studied in the
image analysis community
due to the fact that manual, dense labelling of large amounts of medical
images is a tedious and
error-prone task. Accurate and reliable solutions are desired to
increase clinical work flow efficiency
and support decision making through fast and automatic extraction of
quantitative measurements.
With the advent of convolutional neural networks (CNNs),
near-radiologist level performance can
be achieved in automated medical image analysis tasks including cardiac
MR segmentation [3] and
cancerous lung nodule detection [17]. High representation power, fast
inference, and filter sharing
properties have made CNNs the de facto standard for image segmentation.
Fully convolutional
networks (FCNs) [ 18] and the U-Net [ 24] are two commonly used
architectures. Despite their
good representational power, these architectures rely on multi-stage
cascaded CNNs when the target
organs show large inter-patient variation in terms of shape and size.
Cascaded frameworks extract a
region of interest (ROI) and make dense predictions on that particular
ROI. The application areas
include cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27]
segmentation, and lung CT
nodule detection [17]. However, this approach leads to excessive and
redundant use of computational
resources and model parameters; for instance, similar low-level features
are repeatedly extracted by
all models within the cascade. To address this general problem, we
propose a simple and yet effective
solution, namely attention gates(AGs). CNN models with AGs can be
trained from scratch in a
standard way similar to the training of a FCN model, and AGs
automatically learn to focus on target
1st Conference on Medical Imaging with Deep Learning (MIDL 2018),
Amsterdam, The Netherlands.
arXiv:1804.03999v3 [cs.CV] 20 May 2018
structures without additional supervision. At test time, these gates
generate soft region proposals
implicitly on-the-fly and highlight salient features useful for a specific
task. Moreover, they do not
introduce significant computational overhead and do not require a large
number of model parameters
as in the case of multi-model frameworks. In return, the proposed AGs
improve model sensitivity and
accuracy for dense label predictions by suppressing feature activations
in irrelevant regions. In this
way, the necessity of using an external organ localisation model can be
eliminated while maintaining
the high prediction accuracy. Similar attention mechanisms have been
proposed for natural image
classification [11] and captioning [1] to perform adaptive feature
pooling, where model predictions
are conditioned only on a subset of selected image regions. In this
paper, we generalise this design
and propose image-grid based gating that allows attention coefficients to
be specific to local regions.
Moreover, our approach can be used for attention-based dense
predictions.
We demonstrate the implementation of AG in a standard U-Net architecture
(Attention U-Net) and
apply it to medical images. We choose the challenging CT pancreas
segmentation problem to provide
experimental evidence for our proposed contributions. This problem
constitutes a difficult task due to
low tissue contrast and large variability in organ shape and size. We
evaluate our implementation on
two commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class
abdominal CT-150.
The results show that AGs consistenly improve prediction accuracy across
different datasets and
training sizes while achieving state-of-the-art performance without
requiring multiple CNN models.
[2 Methodology]
Fully Convolutional Network (FCN):Convolutional neural networks (CNNs)
outperform traditional approaches in medical image analysis on public
benchmark datasets [14, 17] while being an
order of magnitude faster than, e.g., graph-cut and multi-atlas
segmentation techniques [34]. This
is mainly attributed to the fact that (I) domain specific image features
are learnt using stochastic
gradient descent (SGD) optimisation, (II) learnt kernels are shared
across all pixels, and (III) image
convolution operations exploit the structural information in medical
images well. In particular, fully
convolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13]
and holistically nested
networks [16, 35] have been shown to achieve robust and accurate
performance in various tasks
including cardiac MR [3], brain tumours [12] and abdominal CT [26, 27]
image segmentation tasks.
Convolutional layers progressively extract higher dimensional image
representations ( xl) by processing local information layer by layer.
Eventually, this separates pixels in a high dimensional
space according to their semantics. Through this sequential process,
model predictions are conditioned on information collected from a large
receptive field. Hence, feature-map xl is obtained
at the output of layer lby sequentially applying a linear transformation
followed by a non-linear
activation function. It is often chosen as rectified linear unit: σ1 ( xl
i,c) = max(0 ,xl
i,c) where iand
c denote spatial and channel dimensions respectively. Feature
activations can be formulated as:
xl
c = σ1
(∑
c′∈Fl
xl−1
c′ ∗kc′,c
)
where ∗denotes the convolution operation, and the spatial subscript
(i) is omitted in the formulation for notational clarity. The function
f( xl; Φ l) = x(l+1) applied in
convolution layer lis characterised by trainable kernel parameters Φ l.
The parameters are learnt by
Attention GateLorem ipsum dolor sit amet,consectetur adipisicing elit,
seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.
x x x
x x x
ReLU
x
Sigmoid Resampler
x x
Figure 2: Schematic of the proposed additive attention gate (AG). Input
features ( xl) are scaled
with attention coefficients (α) computed in AG. Spatial regions are
selected by analysing both the
activations and contextual information provided by the gating signal (g)
which is collected from a
coarser scale. Grid resampling of attention coefficients is done using
trilinear interpolation.
minimising a training objective, e.g. cross-entropy loss, using
stochastic gradient descent (SGD).
In this paper, we build our attention model on top of a standard U-Net
architecture. U-Nets are
commonly used for image segmentation tasks because of their good
performance and efficient use
of GPU memory. The latter advantage is mainly linked to extraction of
image features at multiple
image scales. Coarse feature-maps capture contextual information and
highlight the category and
location of foreground objects. Feature-maps extracted at multiple
scales are later merged through
skip connections to combine coarse- and fine-level dense predictions as
shown in Figure 1.
Attention Gates for Image Analysis:To capture a sufficiently large
receptive field and thus, semantic contextual information, the
feature-map grid is gradually downsampled in standard CNN
architectures. In this way, features on the coarse spatial grid level
model location and relationship
between tissues at global scale. However, it remains difficult to reduce
false-positive predictions for
small objects that show large shape variability. In order to improve the
accuracy, current segmentation
frameworks [14, 26, 27] rely on additional preceding object localisation
models to simplify the task
into separate localisation and subsequent segmentation steps. Here, we
demonstrate that the same
objective can be achieved by integrating attention gates (AGs) in a
standard CNN model. This does
not require the training of multiple models and a large number of extra
model parameters. In contrast
to the localisation model in multi-stage CNNs, AGs progressively
suppress feature responses in
irrelevant background regions without the requirement to crop a ROI
between networks.
Attention coefficients, αi ∈[0,1], identify salient image regions and
prune feature responses to
preserve only the activations relevant to the specific task as shown in
Figure 3a. The output of AGs is
the element-wise multiplication of input feature-maps and attention
coefficients: ˆxl
i,c = xl
i,c ·αl
i. In
a default setting, a single scalar attention value is computed for each
pixel vector xl
i ∈RFl where
Fl corresponds to the number of feature-maps in layer l. In case of
multiple semantic classes, we
propose to learn multi-dimensional attention coefficients. This is
inspired by [ 29], where multidimensional attention coefficients are used
to learn sentence embeddings. Thus, each AG learns to
focus on a subset of target structures. As shown in Figure 2, a gating
vector gi ∈RFg is used for
each pixel ito determine focus regions. The gating vector contains
contextual information to prune
lower-level feature responses as suggested in [32], which uses AGs for
natural image classification.
We use additive attention [2] to obtain the gating coefficient. Although
this is computationally more
expensive, it has experimentally shown to achieve higher accuracy than
multiplicative attention [19].
Additive attention is formulated as follows:
ql
att = ψT (
σ1 ( WT
x xl
i + WT
g gi + bg)
)
+ bψ (1)
αl
i = σ2( ql
att(xl
i, gi; Θatt) ), (2)
where σ2(xi,c) = 1
1+exp(−xi,c) correspond to sigmoid activation function. AG is
characterised
by a set of parameters Θatt containing: linear transformations Wx
∈RFl×Fint, Wg ∈RFg×Fint,
ψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations
are computed using
channel-wise 1x1x1 convolutions for the input tensors. In other contexts
[33], this is referred to as
vector concatenation-based attention, where the concatenated features xl
and gare linearly mapped
to a RFint dimensional intermediate space. In image captioning [1] and
classification [11] tasks, the
Figure 3(a): From left to right (a-e, f-j): Axial and sagittal views of
a
3D abdominal CT scan, attention coefficients, feature activations of
a skip connection before and after gating. Similarly, (k-n) visualise
the gating on a coarse scale skip connection. The filtered feature
activations (d-e, i-j) are collected from multiple AGs, where a subset
of organs is selected by each gate. Activations shown in (d-e, i-j)
consistently correspond to specific structures across different scans.
Figure 3(b): The ground-truth pancreas
segmentation (a) is highlighted in blue
(b). Similarly, U-Net model prediction
(c) and the predictions obtained with Attention U-Net (d) are shown. The
missed
dense predictions by U-Net are highlighted with red arrows.
softmax activation function is used to normalise the attention
coefficients (σ2); however, sequential
use of softmax yields sparser activations at the output. For this
reason, we choose a sigmoid activation
function. This results experimentally in better training convergence for
the AG parameters. In contrast
to [11] we propose a grid-attention technique. In this case, gating
signal is not a global single vector
for all image pixels but a grid signal conditioned to image spatial
information. More importantly, the
gating signal for each skip connection aggregates information from
multiple imaging scales, as shown
in Figure 1, which increases the grid-resolution of the query signal and
achieve better performance.
Lastly, we would like to note that AG parameters can be trained with the
standard back-propagation
updates without a need for sampling based update methods used in
hard-attention [21].
Attention Gates in U-Net Model:The proposed AGs are incorporated into
the standard U-Net
architecture to highlight salient features that are passed through the
skip connections, see Figure
1. Information extracted from coarse scale is used in gating to
disambiguate irrelevant and noisy
responses in skip connections. This is performed right before the
concatenation operation to merge
only relevant activations. Additionally, AGs filter the neuron
activations during the forward pass as
well as during the backward pass. Gradients originating from background
regions are down weighted
during the backward pass. This allows model parameters in shallower
layers to be updated mostly
based on spatial regions that are relevant to a given task. The update
rule for convolution parameters
in layer l−1 can be formulated as follows:
∂(ˆxl
i)
∂(Φl−1) = ∂
(
αl
if(xl−1
i ; Φl−1)
)
∂(Φl−1) = αl
i
∂(f(xl−1
i ; Φl−1))
∂(Φl−1) + ∂(αl
i)
∂(Φl−1) xl
i (3)
The first gradient term on the right-hand side is scaled with αl
i. In case of multi-dimensional AGs, αl
i
corresponds to a vector at each grid scale. In each sub-AG,
complementary information is extracted
and fused to define the output of skip connection. To reduce the number
of trainable parameters
and computational complexity of AGs, the linear transformations are
performed without any spatial
support (1x1x1 convolutions) and input feature-maps are downsampled to
the resolution of gating
signal, similar to non-local blocks [ 33]. The corresponding linear
transformations decouple the
feature-maps and map them to lower dimensional space for the gating
operation. As suggested in
[11], low-level feature-maps, i.e. the first skip connections, are not
used in the gating function since
they do not represent the input data in a high dimensional space. We use
deep-supervision [16] to
force the intermediate feature-maps to be semantically discriminative at
each image scale. This helps
to ensure that attention units, at different scales, have an ability to
influence the responses to a large
range of image foreground content. We therefore prevent dense
predictions from being reconstructed
from small subsets of skip connections.
[4 Discussion And Conclusion]
In this paper, we presented a novel attention gate model applied to
medical image segmentation. Our
approach eliminates the necessity of applying an external object
localisation model. The proposed
approach is generic and modular as such it can be easily applied to
image classification and regression
problems as in the examples of natural image analysis and machine
translation. Experimental
results demonstrate that the proposed AGs are highly beneficial for
tissue/organ identification and
localisation. This is particularly true for variable small size organs
such as the pancreas, and similar
behaviour is expected for global classification tasks.
Training behaviour of the AGs can benefit from transfer learning and
multi-stage training schemes.
For instance, pre-trained U-Net weights can be used to initialise the
attention network, and gates can
be trained accordingly in the fine-tuning stage. Similarly, there is a
vast body of literature in machine
learning exploring different gating architectures. For example, highway
networks [7] make use of
residual connections around the gate block to allow better gradient
backpropagation and slightly
softer attention mechanisms. Although our experiments with residual
connections have not provided
any significant performance improvement, future research will focus on
this aspect to obtain a better
training behaviour. Lastly, we note that with the advent of improved GPU
computation power and
memory, larger capacity 3D models can be trained with larger batch sizes
without the need for image
downsampling. In this way, we would not need to utilise ad-hoc
post-processing techniques to further
improve the state-of-the-art results. Similarly, the performance of
Attention U-Net can be further
enhanced by utilising fine resolution input batches without additional
heuristics. Lastly, we would
like to thank to Salim Arslan and Dan Busbridge for their helpful
comments on this work.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: sentence-transformers/all-MiniLM-L6-v2
- Maximum Sequence Length: 256 tokens
- Output Dimensionality: 384 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Pravallika6/cross-domain-embeddings")
# Run inference
sentences = [
'Are there any frameworks that adapt to different types of image segmentation tasks?',
'[ABSTRACT]\nWe propose a novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes. Models\ntrained with AGs implicitly learn to suppress irrelevant regions in an input image\nwhile highlighting salient features useful for a specific task. This enables us to\neliminate the necessity of using explicit external tissue/organ localisation modules\nof cascaded convolutional neural networks (CNNs). AGs can be easily integrated\ninto standard CNN architectures such as the U-Net model with minimal computational overhead while increasing the model sensitivity and prediction accuracy.\nThe proposed Attention U-Net architecture is evaluated on two large CT abdominal\ndatasets for multi-class image segmentation. Experimental results show that AGs\nconsistently improve the prediction performance of U-Net across different datasets\nand training sizes while preserving computational efficiency. The source code for\nthe proposed architecture is publicly available.\n\n[1 Introduction]\nAutomated medical image segmentation has been extensively studied in the image analysis community\ndue to the fact that manual, dense labelling of large amounts of medical images is a tedious and\nerror-prone task. Accurate and reliable solutions are desired to increase clinical work flow efficiency\nand support decision making through fast and automatic extraction of quantitative measurements.\nWith the advent of convolutional neural networks (CNNs), near-radiologist level performance can\nbe achieved in automated medical image analysis tasks including cardiac MR segmentation [3] and\ncancerous lung nodule detection [17]. High representation power, fast inference, and filter sharing\nproperties have made CNNs the de facto standard for image segmentation. Fully convolutional\nnetworks (FCNs) [ 18] and the U-Net [ 24] are two commonly used architectures. Despite their\ngood representational power, these architectures rely on multi-stage cascaded CNNs when the target\norgans show large inter-patient variation in terms of shape and size. Cascaded frameworks extract a\nregion of interest (ROI) and make dense predictions on that particular ROI. The application areas\ninclude cardiac MRI [ 14], cardiac CT [ 23], abdominal CT [ 26, 27] segmentation, and lung CT\nnodule detection [17]. However, this approach leads to excessive and redundant use of computational\nresources and model parameters; for instance, similar low-level features are repeatedly extracted by\nall models within the cascade. To address this general problem, we propose a simple and yet effective\nsolution, namely attention gates(AGs). CNN models with AGs can be trained from scratch in a\nstandard way similar to the training of a FCN model, and AGs automatically learn to focus on target\n1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\narXiv:1804.03999v3 [cs.CV] 20 May 2018\nstructures without additional supervision. At test time, these gates generate soft region proposals\nimplicitly on-the-fly and highlight salient features useful for a specific task. Moreover, they do not\nintroduce significant computational overhead and do not require a large number of model parameters\nas in the case of multi-model frameworks. In return, the proposed AGs improve model sensitivity and\naccuracy for dense label predictions by suppressing feature activations in irrelevant regions. In this\nway, the necessity of using an external organ localisation model can be eliminated while maintaining\nthe high prediction accuracy. Similar attention mechanisms have been proposed for natural image\nclassification [11] and captioning [1] to perform adaptive feature pooling, where model predictions\nare conditioned only on a subset of selected image regions. In this paper, we generalise this design\nand propose image-grid based gating that allows attention coefficients to be specific to local regions.\nMoreover, our approach can be used for attention-based dense predictions.\nWe demonstrate the implementation of AG in a standard U-Net architecture (Attention U-Net) and\napply it to medical images. We choose the challenging CT pancreas segmentation problem to provide\nexperimental evidence for our proposed contributions. This problem constitutes a difficult task due to\nlow tissue contrast and large variability in organ shape and size. We evaluate our implementation on\ntwo commonly used benchmarks: TCIA Pancreas CT-82 [25] and multi-class abdominal CT-150.\nThe results show that AGs consistenly improve prediction accuracy across different datasets and\ntraining sizes while achieving state-of-the-art performance without requiring multiple CNN models.\n\n[2 Methodology]\nFully Convolutional Network (FCN):Convolutional neural networks (CNNs) outperform traditional approaches in medical image analysis on public benchmark datasets [14, 17] while being an\norder of magnitude faster than, e.g., graph-cut and multi-atlas segmentation techniques [34]. This\nis mainly attributed to the fact that (I) domain specific image features are learnt using stochastic\ngradient descent (SGD) optimisation, (II) learnt kernels are shared across all pixels, and (III) image\nconvolution operations exploit the structural information in medical images well. In particular, fully\nconvolutional networks (FCN) [ 18] such as U-Net [ 24], DeepMedic [ 13] and holistically nested\nnetworks [16, 35] have been shown to achieve robust and accurate performance in various tasks\nincluding cardiac MR [3], brain tumours [12] and abdominal CT [26, 27] image segmentation tasks.\nConvolutional layers progressively extract higher dimensional image representations ( xl) by processing local information layer by layer. Eventually, this separates pixels in a high dimensional\nspace according to their semantics. Through this sequential process, model predictions are conditioned on information collected from a large receptive field. Hence, feature-map xl is obtained\nat the output of layer lby sequentially applying a linear transformation followed by a non-linear\nactivation function. It is often chosen as rectified linear unit: σ1 ( xl\ni,c) = max(0 ,xl\ni,c) where iand\nc denote spatial and channel dimensions respectively. Feature activations can be formulated as:\nxl\nc = σ1\n(∑\nc′∈Fl\nxl−1\nc′ ∗kc′,c\n)\nwhere ∗denotes the convolution operation, and the spatial subscript\n(i) is omitted in the formulation for notational clarity. The function f( xl; Φ l) = x(l+1) applied in\nconvolution layer lis characterised by trainable kernel parameters Φ l. The parameters are learnt by\n\nAttention GateLorem ipsum dolor sit amet,consectetur adipisicing elit, seddo eiusmod tempor incididunt utlabore et dolore magna aliqua.\n x x x \n x x x \nReLU \nx \n \nSigmoid Resampler\nx x \nFigure 2: Schematic of the proposed additive attention gate (AG). Input features ( xl) are scaled\nwith attention coefficients (α) computed in AG. Spatial regions are selected by analysing both the\nactivations and contextual information provided by the gating signal (g) which is collected from a\ncoarser scale. Grid resampling of attention coefficients is done using trilinear interpolation.\nminimising a training objective, e.g. cross-entropy loss, using stochastic gradient descent (SGD).\nIn this paper, we build our attention model on top of a standard U-Net architecture. U-Nets are\ncommonly used for image segmentation tasks because of their good performance and efficient use\nof GPU memory. The latter advantage is mainly linked to extraction of image features at multiple\nimage scales. Coarse feature-maps capture contextual information and highlight the category and\nlocation of foreground objects. Feature-maps extracted at multiple scales are later merged through\nskip connections to combine coarse- and fine-level dense predictions as shown in Figure 1.\nAttention Gates for Image Analysis:To capture a sufficiently large receptive field and thus, semantic contextual information, the feature-map grid is gradually downsampled in standard CNN\narchitectures. In this way, features on the coarse spatial grid level model location and relationship\nbetween tissues at global scale. However, it remains difficult to reduce false-positive predictions for\nsmall objects that show large shape variability. In order to improve the accuracy, current segmentation\nframeworks [14, 26, 27] rely on additional preceding object localisation models to simplify the task\ninto separate localisation and subsequent segmentation steps. Here, we demonstrate that the same\nobjective can be achieved by integrating attention gates (AGs) in a standard CNN model. This does\nnot require the training of multiple models and a large number of extra model parameters. In contrast\nto the localisation model in multi-stage CNNs, AGs progressively suppress feature responses in\nirrelevant background regions without the requirement to crop a ROI between networks.\nAttention coefficients, αi ∈[0,1], identify salient image regions and prune feature responses to\npreserve only the activations relevant to the specific task as shown in Figure 3a. The output of AGs is\nthe element-wise multiplication of input feature-maps and attention coefficients: ˆxl\ni,c = xl\ni,c ·αl\ni. In\na default setting, a single scalar attention value is computed for each pixel vector xl\ni ∈RFl where\nFl corresponds to the number of feature-maps in layer l. In case of multiple semantic classes, we\npropose to learn multi-dimensional attention coefficients. This is inspired by [ 29], where multidimensional attention coefficients are used to learn sentence embeddings. Thus, each AG learns to\nfocus on a subset of target structures. As shown in Figure 2, a gating vector gi ∈RFg is used for\neach pixel ito determine focus regions. The gating vector contains contextual information to prune\nlower-level feature responses as suggested in [32], which uses AGs for natural image classification.\nWe use additive attention [2] to obtain the gating coefficient. Although this is computationally more\nexpensive, it has experimentally shown to achieve higher accuracy than multiplicative attention [19].\nAdditive attention is formulated as follows:\nql\natt = ψT (\nσ1 ( WT\nx xl\ni + WT\ng gi + bg)\n)\n+ bψ (1)\nαl\ni = σ2( ql\natt(xl\ni, gi; Θatt) ), (2)\nwhere σ2(xi,c) = 1\n1+exp(−xi,c) correspond to sigmoid activation function. AG is characterised\nby a set of parameters Θatt containing: linear transformations Wx ∈RFl×Fint, Wg ∈RFg×Fint,\nψ ∈RFint×1 and bias terms bψ ∈R , bg ∈RFint. The linear transformations are computed using\nchannel-wise 1x1x1 convolutions for the input tensors. In other contexts [33], this is referred to as\nvector concatenation-based attention, where the concatenated features xl and gare linearly mapped\nto a RFint dimensional intermediate space. In image captioning [1] and classification [11] tasks, the\n\nFigure 3(a): From left to right (a-e, f-j): Axial and sagittal views of a\n3D abdominal CT scan, attention coefficients, feature activations of\na skip connection before and after gating. Similarly, (k-n) visualise\nthe gating on a coarse scale skip connection. The filtered feature\nactivations (d-e, i-j) are collected from multiple AGs, where a subset\nof organs is selected by each gate. Activations shown in (d-e, i-j)\nconsistently correspond to specific structures across different scans.\nFigure 3(b): The ground-truth pancreas\nsegmentation (a) is highlighted in blue\n(b). Similarly, U-Net model prediction\n(c) and the predictions obtained with Attention U-Net (d) are shown. The missed\ndense predictions by U-Net are highlighted with red arrows.\nsoftmax activation function is used to normalise the attention coefficients (σ2); however, sequential\nuse of softmax yields sparser activations at the output. For this reason, we choose a sigmoid activation\nfunction. This results experimentally in better training convergence for the AG parameters. In contrast\nto [11] we propose a grid-attention technique. In this case, gating signal is not a global single vector\nfor all image pixels but a grid signal conditioned to image spatial information. More importantly, the\ngating signal for each skip connection aggregates information from multiple imaging scales, as shown\nin Figure 1, which increases the grid-resolution of the query signal and achieve better performance.\nLastly, we would like to note that AG parameters can be trained with the standard back-propagation\nupdates without a need for sampling based update methods used in hard-attention [21].\nAttention Gates in U-Net Model:The proposed AGs are incorporated into the standard U-Net\narchitecture to highlight salient features that are passed through the skip connections, see Figure\n1. Information extracted from coarse scale is used in gating to disambiguate irrelevant and noisy\nresponses in skip connections. This is performed right before the concatenation operation to merge\nonly relevant activations. Additionally, AGs filter the neuron activations during the forward pass as\nwell as during the backward pass. Gradients originating from background regions are down weighted\nduring the backward pass. This allows model parameters in shallower layers to be updated mostly\nbased on spatial regions that are relevant to a given task. The update rule for convolution parameters\nin layer l−1 can be formulated as follows:\n∂(ˆxl\ni)\n∂(Φl−1) = ∂\n(\nαl\nif(xl−1\ni ; Φl−1)\n)\n∂(Φl−1) = αl\ni\n∂(f(xl−1\ni ; Φl−1))\n∂(Φl−1) + ∂(αl\ni)\n∂(Φl−1) xl\ni (3)\nThe first gradient term on the right-hand side is scaled with αl\ni. In case of multi-dimensional AGs, αl\ni\ncorresponds to a vector at each grid scale. In each sub-AG, complementary information is extracted\nand fused to define the output of skip connection. To reduce the number of trainable parameters\nand computational complexity of AGs, the linear transformations are performed without any spatial\nsupport (1x1x1 convolutions) and input feature-maps are downsampled to the resolution of gating\nsignal, similar to non-local blocks [ 33]. The corresponding linear transformations decouple the\nfeature-maps and map them to lower dimensional space for the gating operation. As suggested in\n[11], low-level feature-maps, i.e. the first skip connections, are not used in the gating function since\nthey do not represent the input data in a high dimensional space. We use deep-supervision [16] to\nforce the intermediate feature-maps to be semantically discriminative at each image scale. This helps\nto ensure that attention units, at different scales, have an ability to influence the responses to a large\nrange of image foreground content. We therefore prevent dense predictions from being reconstructed\nfrom small subsets of skip connections.\n\n[4 Discussion And Conclusion]\nIn this paper, we presented a novel attention gate model applied to medical image segmentation. Our\napproach eliminates the necessity of applying an external object localisation model. The proposed\napproach is generic and modular as such it can be easily applied to image classification and regression\nproblems as in the examples of natural image analysis and machine translation. Experimental\nresults demonstrate that the proposed AGs are highly beneficial for tissue/organ identification and\nlocalisation. This is particularly true for variable small size organs such as the pancreas, and similar\nbehaviour is expected for global classification tasks.\nTraining behaviour of the AGs can benefit from transfer learning and multi-stage training schemes.\nFor instance, pre-trained U-Net weights can be used to initialise the attention network, and gates can\nbe trained accordingly in the fine-tuning stage. Similarly, there is a vast body of literature in machine\nlearning exploring different gating architectures. For example, highway networks [7] make use of\nresidual connections around the gate block to allow better gradient backpropagation and slightly\nsofter attention mechanisms. Although our experiments with residual connections have not provided\nany significant performance improvement, future research will focus on this aspect to obtain a better\ntraining behaviour. Lastly, we note that with the advent of improved GPU computation power and\nmemory, larger capacity 3D models can be trained with larger batch sizes without the need for image\ndownsampling. In this way, we would not need to utilise ad-hoc post-processing techniques to further\nimprove the state-of-the-art results. Similarly, the performance of Attention U-Net can be further\nenhanced by utilising fine resolution input batches without additional heuristics. Lastly, we would\nlike to thank to Salim Arslan and Dan Busbridge for their helpful comments on this work.',
'[Preamble]\nImproving language models by retrieving\nfrom trillions of tokens\nSebastian Borgeaudy, Arthur Menschy, Jordan Hoffmanny, Trevor Cai, Eliza Rutherford, Katie Millican,\nGeorge van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,\nAurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones,\nAlbin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero,\nKaren Simonyan, Jack W. Raez, Erich Elsenzand Laurent Sifrey,z\nAll authors from DeepMind,yEqual contributions,zEqual senior authorship\nWe enhance auto-regressive language models by conditioning on document chunks retrieved from a\nlarge corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our\nRetrieval-Enhanced Transformer (R/e.sc/t.sc/r.sc/o.sc) obtains comparable performance to GPT-3 and Jurassic-1\non the Pile, despite using 25\x02fewer parameters. After fine-tuning,R/e.sc/t.sc/r.sc/o.scperformance translates to\ndownstream knowledge-intensive tasks such as question answering.R/e.sc/t.sc/r.sc/o.sccombines a frozenB/e.sc/r.sc/t.sc\nretriever, adifferentiableencoderandachunkedcross-attentionmechanismtopredicttokensbasedon\nan order of magnitude more data than what is typically consumed during training. We typically train\nR/e.sc/t.sc/r.sc/o.scfrom scratch, yet can also rapidlyR/e.sc/t.sc/r.sc/o.scfit pre-trained transformers with retrieval and still\nachieve good performance. Our work opens up new avenues for improving language models through\nexplicit memory at unprecedented scale.\n\n[1. Introduction]\nLanguage modelling (LM) is an unsupervised task that consists of modelling the probability of text,\nusually by factorising it into conditional next-token predictions𝑝¹𝑥1\x94\x93\x93\x93\x94𝑥 𝑛º= Î\n𝑖 𝑝¹𝑥𝑖j𝑥\x9d𝑖º. Neural\nnetworks have proven to be powerful language models, first in the form of recurrent architectures\n(Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of\nTransformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance\nimprovementshavecomefromincreasingtheamountofdata, trainingcompute, ormodelparameters.\nTransformers have been scaled from100 million parameter models in seminal work to over hundred\nbillion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to\nmodels that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model\nsize predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020).\nThe benefits of increasing the number of parameters come from two factors: additional computations\nat training and inference time, and increased memorization of the training data.\nIn this work, we endeavor to decouple these, by exploring efficient means of augmenting language\nmodels with a massive-scale memory without significantly increasing computations. Specifically, we\nsuggest retrieval from a large text database as a complementary path to scaling language models.\nInstead of increasing the size of the model and training on more data, we equip models with the\nability to directly access a large database to perform predictions—a semi-parametric approach. At\na high level, our Retrieval Transformer (R/e.sc/t.sc/r.sc/o.sc) model splits the input sequence into chunks and\nretrieves text similar to the previous chunk to improve the predictions in the current chunk. Existing\nretrieval for language modelling work only considers small transformers (100 millions parameters)\nand databases of limited size (up to billions of tokens) (Guu et al., 2020; Khandelwal et al., 2020;\nLewisetal.,2020;Yogatamaetal.,2021). Toourknowledge, ourworkisthefirsttoshowthebenefits\nof scaling the retrieval database to trillions of tokens for large parametric language models. Our main\nCorresponding authors: {sborgeaud|amensch|jordanhoffmann|sifre}@deepmind.com\narXiv:2112.04426v3 [cs.CL] 7 Feb 2022\nImproving language models by retrieving from trillions of tokens\n200 400 800 1600 7500\nNumber of Non-Embedding Params (M)\n0.7\n0.8\n0.9\n1.0C4 Eval bits-per-byte\n172M 425M 1.5B 7.5B Baseline RETRO [OFF] RETRO [ON]\n0 1 10 100 1000 10000\nRetrieval dataset (B Tokens)\n0.7\n0.8\n0.9\n1.0\n0 1 3 5 10 30 50 100\nNumber of neighbors\n0.7\n0.8\n0.9\n1.0\nFigure 1jScaling ofR/e.sc/t.sc/r.sc/o.sc. The performance gain of our retrieval models remains constant with\nmodel scale (left), and is comparable to multiplying the parameteric model size by\x1810\x02. The gain\nincreases with the size of the retrieval database (middle) and the number of retrieved neighbours\n(right) on the C4 validation set, when using up to 40 neighbours. Past this, performance begins to\ndegrade, perhaps due to the reduced quality. At evaluationR/e.sc/t.sc/r.sc/o.sccan be used without retrieval\ndata (R/e.sc/t.sc/r.sc/o.sc[OFF]), bringing limited performance degradation compared to baseline transformers.\ncontributions are the following.\n• We introduceR/e.sc/t.sc/r.sc/o.sc, a retrieval-enhanced autoregressive language model (§2.2). We use a\nchunked cross-attention module to incorporate the retrieved text (§2.4), with time complexity\nlinear in the amount of retrieved data. We show that retrieving based on a pre-trained frozen\nB/e.sc/r.sc/t.scmodel (§2.3) works at scale, removing the need for training and updating a retriever\nnetwork.\n• We show that our method scales well with model size and database size (Fig. 1):R/e.sc/t.sc/r.sc/o.sc\nprovides a constant gain for models ranging from 150M to 7B parameters, andR/e.sc/t.sc/r.sc/o.sccan be\nimproved at evaluation time by increasing the database size and the number of retrieved neighbours. Our largest model obtains state-of-the-art results on a range of downstream evaluation\ndatasets including Wikitext103 (Merity et al., 2017) and the Pile (Gao et al., 2020) (§4). We\nshow thatR/e.sc/t.sc/r.sc/o.sccan be fine-tuned to achieve competitive performance on downstream tasks\nsuch as question answering (§4.3).\n• We propose an evaluation aware of proximity of test documents with the training set (§2.6),\naddressing the problem of test set leakage (Lee et al., 2021). This is relevant for all language\nmodels,andespeciallyforretrieval-enhancedmodelssincetheyhavedirectaccesstothetraining\ndataset during evaluation. Using this methodology, we show that the performance ofR/e.sc/t.sc/r.sc/o.sc\ncomes from both explicit neighbour copying and general knowledge extraction (§4.4).\n\n[2. Method]\nWedesignourretrieval-enhancedarchitecturetobecapableofretrievingfromadatabasewithtrillions\nof tokens. For this purpose, we retrieve at the level of contiguous tokenchunks instead of individual\ntokens which reduces storage and computation requirements by a large linear factor. Our method first\nconstructs a key-value database, where values store raw chunks of text tokens and keys are frozen\nB/e.sc/r.sc/t.scembedddings (Devlin et al., 2019). We use a frozen model to avoid having to periodically\nre-compute embeddings over the entire database during training. Each training sequence is then split\ninto chunks, which are augmented with their𝑘-nearest neighbour retrieved from the database. An\nencoder-decoder architecture integrates retrieval chunks into the model’s predictions. We summarize\nthe R/e.sc/t.sc/r.sc/o.scarchitecture in Fig. 2, and detail it in this section. We end the section by introducing\n\nImproving language models by retrieving from trillions of tokens\nCCA FFW\nTransformer \nEncoderRetrieval\ndataset\nFrozen kNN Retriever\nK V\nRETRO block (x L) \nNeighbours\nInput \ntokens\nChunked cross-attention (CCA)\nBERT\nBERT\nCondition\nAttending chunks\nEncoded neighbours\nCA\nCA\nATTN QEMB READ\nAttend\nEncoded neighbours\nC1\nC2\nC3\nH1\nH2\nH3\nH\nH1\n+\nH2\n+\nE1\n E2\nE1\nE2\nCA(H1\n+, E1)\nCA(H2\n+, E2)\nCCA(H, E)\nX\nFigure 2jR/e.sc/t.sc/r.sc/o.scarchitecture. Left: simplified version where a sequence of length𝑛= 12 is split\ninto𝑙 = 3 chunksofsize 𝑚 = 4. Foreachchunk, weretrieve 𝑘 = 2 neighboursof 𝑟 = 5 tokenseach. The\nretrieval pathway is shown on top.Right: Details of the interactions in theC/c.sc/a.scoperator. Causality is\nmaintained as neighbours of the first chunk only affect the last token of the first chunk and tokens\nfrom the second chunk.\na new methodology to evaluate language models when an evaluation set is partially present in the\ntraining set.\n\n[2. The Model Receives The Corresponding]\nvalues R/e.sc/t.sc¹𝐶º, ¹»𝑁1\x94𝐹1¼\x94\x93\x93\x93\x94 »𝑁𝑘\x94𝐹𝑘¼º. Both neighbour chunks and their continuations provide\nmeaningful improvements, as illustrated in our ablation study (Appendix D). We use a length64 for\nboth 𝑁𝑗 and 𝐹𝑗, thusR/e.sc/t.sc¹𝐶ºhas a shape of𝑘\x02𝑟 with 𝑟 = 128. To avoid retrieving the chunk𝐶𝑢¸1\nin the retrieval setR/e.sc/t.sc¹𝐶𝑢º, which would break causality during training, we filter out neighbours\noriginating from the same document as the training sequence𝑋.\nFor a database of𝑇 elements, we can query the approximate nearest neighbours inO¹log𝑇ºtime.\nWe use the SCaNN library (Guo et al., 2020) to achieve this. This means that we can query our\n2 trillion token database in10 ms whilst evaluating or sampling from the model; this expense is\namortized over a chunk length. Performing retrieval on-the-fly is too slow to keep up with the training\ncalculations—we leverage the frozen aspect of the embedding operatorB/e.sc/r.sc/t.scto precompute all\napproximate nearest neighbours and save the results as part of the data. In Fig. 9 in the Appendix, we\nshow results where we only retrieve neighbours within Wikipedia. We find that neighbours tend to\ncome from 2-3 links away from a given article whereas random articles are more than 5 links apart.\nTable1 jMassiveText. Thelastcolumnindicatesthesamplingweightduringtraining. Themultilingual\nsubsets include documents in 10 languages. The full breakdown is given in §A.1.\nSource Token count (M) Documents (M) Multilingual Sampling frequency\nWeb 977,563 1,208 Yes 55%\nBooks 3,423,740 20 No 25%\nNews 236,918 398 No 10%\nWikipedia 13,288 23 Yes 5%\nGitHub 374,952 143 No 5%\n\nImproving language models by retrieving from trillions of tokens\n2.4. R/e.sc/t.sc/r.sc/o.scmodel architecture\nOur model relies on an encoder-decoder transformer architecture, integrating the retrieved data\nthrough a cross-attention mechanism as introduced in Vaswani et al. (2017). First, the retrieved\ntokens R/e.sc/t.sc¹𝐶ºare fed into an encoder Transformer, which computes the encoded neighbours set𝐸.\nDenoting the intermediate activations by𝐻, our transformer decoder then interleavesR/e.sc/t.sc/r.sc/o.sc-blocks\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 ºand standard Transformer blocksLM ¹𝐻º(the hyperparameter𝑃 \x12»1\x94𝐿¼determines at\nwhich layers we use aR/e.sc/t.sc/r.sc/o.sc-block). These blocks are built from three different residual operators\nwith signatureℝ𝑛\x02𝑑 !ℝ𝑛\x02𝑑: a fully-connected layerF/f.sc/w.sc, the standard sequence-level self-attention\nlayer A/t.sc/t.sc/n.sc, and a chunked cross-attention layerC/c.sc/a.sc¹\x01\x94𝐸ºthat incorporates information from the\nretrieval encoder:\nR/e.sc/t.sc/r.sc/o.sc¹𝐻\x94𝐸 º, F/f.sc/w.sc¹C/c.sc/a.sc¹A/t.sc/t.sc/n.sc¹𝐻º\x94𝐸ºº\x94 and L/m.sc¹𝐻º, F/f.sc/w.sc¹A/t.sc/t.sc/n.sc¹𝐻ºº (2)\nSince F/f.sc/w.sc, A/t.sc/t.sc/n.scand C/c.sc/a.scare all autoregressive operators whose output at position𝑖 only\ndepends on ¹ℎ𝑗º𝑗6𝑖, any succession ofR/e.sc/t.sc/r.sc/o.scand /l.sc/m.sclayers, followed by a token classification\nhead defines an autoregressive log-likelihood(1). An overview of the model architecture is given in\nAlgorithm 1 and in Fig. 2. We next describe the retrieval encoder and the chunked cross-attention\nlayer in more detail, and explain how to sample fromR/e.sc/t.sc/r.sc/o.sc.\nEncodingretrievalneighbours. Foreachchunk 𝐶𝑢,the 𝑘retrievalneighbours R/e.sc/t.sc¹𝐶𝑢ºarefedinto\na bi-directional transformerE/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc, yielding the outputs𝐸𝑗\n𝑢 , E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º𝑗\x94𝐻𝑢º2 ℝ𝑟\x02𝑑0\n,\nwhere 𝑗 2 »1\x94𝑘¼indexes each neighbour. The retrieval encoder is a non-causal transformer. It\nis conditioned on𝐻𝑢, the activations of chunk𝐶𝑢, through cross-attention layers; this allows the\nrepresentations of the retrieval encoder to be modulated by the retrieving chunk in a differentiable\nway. More precisely, the encoding of the𝑗th neighbour of the𝑢th chunk, R/e.sc/t.sc¹𝐶𝑢º𝑗, depends on the\nattended activation 𝐻𝑢 , ¹ℎ¹𝑢\x001º𝑚¸𝑖º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑 of chunk𝐶𝑢 at layermin¹𝑃º. All neighbours for\nall chunks are encoded in parallel, yielding a full encoded set𝐸 , ¹𝐸𝑗\n𝑢º𝑢2»1\x94𝑙¼\x94𝑗2»1\x94𝑘¼ 2ℝ𝑙\x02𝑘\x02𝑟\x02𝑑0\n. We\ndenote 𝐸𝑢 2ℝ𝑘\x02𝑟\x02𝑑0\nas the encoded neighbours for chunk𝑢 2»1\x94𝑙¼.\nChunked cross-attention. To perform theC/c.sc/a.scoperation, we first split a given intermediate activation 𝐻 2ℝ𝑛\x02𝑑 into 𝑙\x001 attending chunks\n\x10\n𝐻¸\n𝑢 , ¹ℎ𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼ 2ℝ𝑚\x02𝑑\n\x11\n𝑢2»1\x94𝑙\x001¼\n, as depicted on the\nright of Fig. 2.𝐻¸\n𝑢 holds the intermediary embeddings of the last token in chunk𝐶𝑢 and of the first\n𝑚\x001 tokens in𝐶𝑢¸1 2. We compute the cross-attention between𝐻¸\n𝑢 and 𝐸𝑢—the encoded retrieval\nset obtained from chunk𝐶𝑢. Attention is computed across time and across neighbours simultaneously,\nas we merge the neighbour and time dimensions of𝐸𝑢 before applying cross-attention. Since there\nis a notion of alignment between data chunks and retrieval neighbours, we use relative positional\nencodings as described in §B.1.2.\nWe concatenate the𝑙\x001 outputs of the per-chunk cross-attentions (each of shape𝑚\x02𝑑) across\ntime, and properly pad the result; we thus form the output activationC/c.sc/a.sc¹𝐻\x94𝐸 º2 ℝ𝑛\x02𝑑. Formally,\nfor each chunk𝐶𝑢 and for each token𝑖 2»1\x94𝑚¼we set\nC/c.sc/a.sc¹𝐻\x94𝐸 º𝑢𝑚¸𝑖\x001 , C/a.sc¹ℎ𝑢𝑚¸𝑖\x001\x94𝐸𝑢º\x94 (3)\n2The last token of chunk𝐶𝑢 is the first to be able to access the retrieved content𝐸𝑢 while maintaining autoregressivity\nin (1). Hence, there is a one token overlap between chunk𝐶𝑢 =\n\x10\n𝑥¹𝑢\x001º𝑚¸𝑖\n\x11\n𝑖2»1\x94𝑚¼\nand the corresponding attending chunk\n𝐶¸\n𝑢 , ¹𝑥𝑢𝑚¸𝑖\x001º𝑖2»1\x94𝑚¼.\n\nImproving language models by retrieving from trillions of tokens\nAlgorithm 1: Overview ofR/e.sc/t.sc/r.sc/o.scmodel architecture.\nHyperparam: 𝑃 and 𝑃enc, indices of layers with cross-attention in the decoder and encoder\nrespectively\nHyperparam: 𝐿and 𝐿enc, number of decoder layers and number of encoder layers.\nInput: 𝑋 2𝕍𝑛: sequence of tokens.¹R/e.sc/t.sc¹𝐶𝑢ºº16𝑢6𝑙: the retrieved neighbours\nOutput: 𝑂 2ℝ𝑛\x02j𝕍j: the output logits\ndef E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º:\n¹𝐻𝑢º𝑢2»1\x94𝑙¼ S/p.sc/l.sc/i.sc/t.sc¹𝐻º\nfor 𝑗 2»1\x94𝑘¼\x94𝑢 2»1\x94𝑙¼do // Encoder shared across neighbours and chunks\n𝐸𝑗\n𝑢 = E/m.sc/b.scenc¹R/e.sc/t.sc¹𝐶𝑢º𝑗º // May be shared with the decoder E M B\nfor 𝑝02»1\x94𝐿enc¼do\n𝐸𝑗\n𝑢 A/t.sc/t.sc/n.scenc¹𝐸𝑗\n𝑢º // Bi-directional attention\nif 𝑝02𝑃enc then\n𝐸𝑗\n𝑢 C/a.scenc¹𝐸𝑗\n𝑢\x94𝐻𝑢º\n𝐸𝑗\n𝑢 F/f.sc/w.scenc¹𝐸𝑗\n𝑢º\nreturn 𝐸\n𝐻 E/m.sc/b.sc¹𝑋º\nfor 𝑝 2»1\x94𝐿¼do\n𝐻 A/t.sc/t.sc/n.sc¹𝐻º // Causal attention\nif 𝑝= min¹𝑃ºthen\n// The neighbour E N C O D E Ris conditioned with the decoder activations of\nthe last layer before the first cross-attention\n𝐸 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢º16𝑢6𝑙\x94𝐻 º\nif 𝑝 2𝑃 then\n𝐻 C/c.sc/a.sc¹𝐻\x94𝐸 º\n𝐻 F/f.sc/w.sc¹𝐻º\n𝑂 R/e.sc/a.sc/d.sc¹𝐻º\nwhere C/a.scis the cross-attention residual operator over time-concatenated encoded neighbours. We\nrecall that this operator is defined in its simplest version by three parameter matrices𝐾 2ℝ𝑑\x02𝑐\x94𝑄 2\nℝ𝑑\x02𝑐 and𝑉 2ℝ𝑑\x02𝑑. For allℎ 2ℝ𝑑 and𝑌 2ℝ𝑇\x02𝑑, we define\nC/a.sc¹ℎ\x94𝑌º, softmax¹𝑌𝐾𝑄𝑇ℎº𝑌𝑉\x94 (4)\nwhere the softmax is performed on the second dimension and all products are matrix products. We\nuse multi-head cross-attention, and add positional encodings to the softmax(see §B.1.2).\nThe first𝑚\x001 tokens cannot attend to any neighbour of a previous chunk; at these positions, we\ndefine C/c.sc/a.scas the identity, settingC/c.sc/a.sc¹𝐻\x94𝐸 º𝑗 , ℎ𝑗 for all tokens𝑗 2»1\x94𝑚 \x001¼. Finally, the last token\nℎ𝑙𝑚 attends to the last retrieval set𝐸𝑙 and we setℎ𝑙𝑚 , C/a.sc¹ℎ𝑙𝑚\x94𝐸𝑙º(not shown in Fig. 2). Listing 1\ncontains a simplified implementation ofC/c.sc/a.sc. Note that chunked cross-attention is autoregressive:\nthe output ofC/c.sc/a.scat position𝑖depends on the sequence from tokens from0 to 𝑖that is input toC/c.sc/a.sc.\nWith R/e.sc/t.sc/r.sc/o.scmodels, even though eachC/c.sc/a.sccross-attention attends only to the neighbours of\nthe preceding chunkR/e.sc/t.sc¹𝐶𝑢\x001º, the dependencies over previous neighbours are propagated via the\nself-attentionoperations. Theactivationsofthe 𝑖th tokeninthe 𝑢th chunkthereforepotentiallydepend\nupon the set ofallprevious neighboursR/e.sc/t.sc¹𝐶𝑢0º𝑢0\x9d𝑢, without incurring the quadratic cost of cross\nattending to that set.\n\nImproving language models by retrieving from trillions of tokens\nSampling. Whensampling,attheendofachunk 𝐶𝑢,weuseSCaNNtoretrieveneighbours R/e.sc/t.sc¹𝐶𝑢º,\nbased on the embeddingB/e.sc/r.sc/t.sc¹𝐶𝑢º. The encoded neighbours𝐸𝑢 = E/n.sc/c.sc/o.sc/d.sc/e.sc/r.sc¹R/e.sc/t.sc¹𝐶𝑢ººare then\nused to condition the generation of the next chunk𝐶𝑢¸1, which we do incrementally: overall the\ncost of sampling is thus quadratic in the size of the sampled sequence, as when sampling from\nregular Transformers; the added cost of retrieval is linear in the number of chunks𝑙, and is negligible\ncompared to the token sampling cost in practice.\n\n[2.5. Baseline Transformer Architecture]\nWe use a transformer (Vaswani et al., 2017) similar to the one described in (Radford et al., 2019),\nwith some minimal changes: we replace LayerNorm with RMSNorm (Zhang and Sennrich, 2019) and\nuse relative position encodings (Dai et al., 2019). As baselines, we train retrieval-free transformers\nwith 132M, 368M, 1.3B and 7.0B parameters (embedding matrices are excluded from parameter\ncounts). The hyperparameters we used are detailed in Table 2. All retrieval models use the same\nsize encoder for the retrieval data, with𝑑0= 896 and 2 layers, which roughly adds19𝑀 parameters.\nThe encoder uses relative positional encodings. The retrieval models contain oneR/e.sc/t.sc/r.sc/o.sc-block every\n3 blocks, starting from layer 6. For our smallest model,C/c.sc/a.scis applied in layers 6, 9 and 12 of the\nmain pathway and also once for query conditioning in the encoder, which adds an additional12𝑀\nparameters. The relative number of extra parameters reduces as we increase the baseline model size.\nAll models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).\n\n[5. Conclusion]\nWe present Retrieval-Enhanced Transformers (R/e.sc/t.sc/r.sc/o.sc), a method for modelling arbitrary text sequences whilst retrieving from databases with trillions of tokens—scaling the data available to models\nby an order of magnitude compared to what is typically consumed during training.R/e.sc/t.sc/r.sc/o.scmodels\n\nImproving language models by retrieving from trillions of tokens\ngains do not diminish for models with up to at least 7B parameters, and correspond to non-retrieval\nmodels with 10\x02more parameters on certain datasets. On Wikitext103 and the Pile,R/e.sc/t.sc/r.sc/o.scoutperforms previous models trained on large scale datasets. We also show thatR/e.sc/t.sc/r.sc/o.scis competitive on\nretrieval-intensive downstream tasks such as question answering.\nR/e.sc/t.sc/r.sc/o.scmodels are flexible and can be used without retrieval at evaluation and still achieve\ncomparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned\ninto R/e.sc/t.sc/r.sc/o.scmodelstoobtainnearlythesameperformanceasiftrainedfromscratch. Carefulanalysis\nshows that only a modest fraction of the gains obtained byR/e.sc/t.sc/r.sc/o.scare due to test set leakage. In\ngeneral, we caution for such leakage in large-scale language datasets and suggest further work in\nbetter understanding the role of test set leakage in the performance of large-scale language models.\nOverall, our work demonstrates at an unprecedented scale that semi-parametric approaches can\nprovide an orthogonal, more efficient approach than raw parameter scaling as we seek to build more\npowerful language models.\nAcknowledgements\nWe would like to thank Nikolai Grigorev, Marc’aurelio Ranzato, Cyprien de Masson d’Autume, Po-Sen\nHuang, JohannesWelbl, LisaAnneHendricks, EthanPerez, JeffStanway, EricNoland, GregoryWayne,\nJohn Jumper, Julian Schrittwieser, Lorrayne Bennett, Devang Agrawal, Dani Yogatama, Susannah\nYoung, Nando de Freitas, Demis Hassabis, and Koray Kavukcuoglu for their help, advice and reviews.\nAdditionally, we would like to thank Zonglin Li, David Simcha, and the ScaNN developers for their\nhelp.\n\nImproving language models by retrieving from trillions of tokens\nTable 6jSample - Beavers are interesting animals. TheR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample quickly diverges to other\nanimalswhilethe R/e.sc/t.sc/r.sc/o.sc[O/n.sc]sampletendstostayfocusedonthebeavertopicduetoneighbourconditioning.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nBeavers are interesting animals that Beavers are interesting animals that .Beaversbuildtheir lodges in pon naw them into smaller sectionsand d\nlive near rivers. They build live near rivers. They build ds they have createdin wooded areas rag them intothe water.Engineers\n.Like many thingsin nature, there ofthe PondBeaversare interesting\nis a connection between creaturesi animals because they changethe hab\nn the wild.Beaverponds causetree itat in which theylive.Beaversdo\ns todrown, butthe dead trees attra this by blockingup streams to creat\nct the great blue heron, which often e ponds. Then theybuildtheir homes\nreturn year after year. Over time, , calledlodges, inthese ponds. Bea\na beaverpond can attract more than vers’ bodies make them well-suited f\n50 nestsin a colony, calleda rooke or underwater building Special muscl\nry.An example of this canbe found es close offtheir noses, ears, and\nin the large pond off Bradford Road throats to keep the water out. Beave\nat Carter Fields nearthe Boxford l rs’ broadtails act like rudders for\nine.Chris Leahy, an expert withth steering. Their two very large, ora\ne Massachusetts Audubon Society who nge frontteeth are usedto gnaw dow\nwrote n trees. They begin buildingtheir d\nam\ndams to create ponds. Frogs are am their houses called beaverdams in , then they mean that youare veryb ar-like tail, andtwo protrudingtee\nphibians, so they can live in both l the riverbeds. They alsoliveon lan usy. Beaversswim easily in streams, th that are strongenoughto gnaw do\nand and water. They have great camou d.Beaversuse their strong teeth an picking uprocks and sticks to buil wn trees. Thebeaveruses trees, bra\nflage to hide from predators. The G d strong jawsto cut downtrees and d their dams. They gnaw at trees wit nches, andmudto builddams across\nolden Retriever, or Golden, is a ver branches to buildtheir homes. They h their big frontteeth to cut them riversand streams. These dams creat\ny big dog. They are very strong, and also usetheir teeth andjawsto che down. Then they useparts of the tre e a deep poolofwater in whichthe\nhave a thick coat to help them live w up food. Beaversuse their big, fl es tobuildtheir houses.Beavers ar beaverbuilds its home. A beaverhom\nin at tails toswim. They use e clever builders. They know exactly e is calleda lodge. A babybeavero\nwhat they need to buildtheir beave r “kit” remains inthe family lodge\nr dams. They use mud fromthe stream until the age oftwo.Beaverfur, kn\nto maketheir dams stay together. T own as pelt, was once highly popular\nhey use their tails to pat downthe asa trim for hatsand coats. How m\nmud.Beaversput a snug roomat the ight the popularity of beaverfur co\ntop of theirdams fortheir babies. ntributedto thecolonizationofNew\nThey store their food underwater. Be York?wwwWw\naverseat the bark fromthe\ncold areas. A kangaroo is an anima their sharp teeth to gnaw atrocks , thenthey mean that you are very b w attrees with their big frontteet\nl that is very large. It can jump ve and cut trees. Theyuse their flat t usy. Beaversswim easily in streams, h to cutthem down. Then they use pa\nry high. They can run very fast. A ails to steer whenthey areswimming picking up rocks and sticks to buil rts of the trees tobuildtheir hous\nmeerkat is a small animal that lives .In the winter, beavershaveto bui d their dams. They gnaw attrees wit es.Beavers are clever builders. The\nin Africa. They live in colonies of ld a house calleda beaverlodge. Th h their big frontteeth to cutthem y know exactly whatthey need to bui\nup to a hundred of them. They can c ey build ahouse that is very strong down. Then they use parts of the tre ld their beaver dams. They use mud f\nlimb trees very easily. . The wallsare made oftwigs. The r es tobuildtheir houses.Beavers ar rom the stream to maketheir dams st\noofis made e clever builders. They know exactly ay together. They use their tails to\nwhat they need to buildtheir beave pat downthe mud.Beavers put a snu\nr dams. They use mud fromthe stream g room atthe top oftheir dams for\nto maketheir dams stay together. T their babies. They store their food\nhey use their tails to pat downthe underwater. Beavers eatthe bark fro\nmud.Beavers put a snug room atthe m the treesthat they cutdown!1. W\ntop oftheir dams fortheir babies. hat isthe main ideaof thefirst pa\nThey store their food underwater. Be ragraph?.2. What isthe main ideao\navers eatthe bark fromthe f thesecond paragraph?\nA mouse is a small mammal that lives ofbranches and other treeparts. T\non land. It is a very good climber hey also use their strong jawsto cu\nand it can run very fast. Penguins t trees. They bring them to theirho\nare birds that live on Antarctica. T use. They alsouse their sharp teeth\nhey have a thick coat to keep them w to chew up thetree parts. They use\narm. Rabbits are small animals that their flat tails to swim to thetop\nlive in the ground. They oftheir house. Then they use their\nteeth andjawsto chew up thetree\n\nImproving language models by retrieving from trillions of tokens\nTable 7jSample - Hamlet, Act 1, Scene 1.The R/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]sample has correct syntax but is hallucinated,\nand ends with repetition of one character (FRANCISCO Approach me not). TheR/e.sc/t.sc/r.sc/o.sc[O/n.sc]sample is the\ncorrect continuation of the original text, and is robust to formatting differences between our prompt and the\nretrieved data.\nPrompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/f.sc/f.sc]Prompt and sample ofR/e.sc/t.sc/r.sc/o.sc[O/n.sc]»𝑁1𝑢\x94𝐹1𝑢¼colored by LCP with𝐶𝑢¸1 »𝑁2𝑢\x94𝐹2𝑢¼colored by LCP with𝐶𝑢¸1\ncolored by LCP withR/e.sc/t.sc¹𝐶𝑢\x001º\nLCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5 LCP = 0, 1, 2, 3,4,> 5\nACT I SCENE I. Elsinore. A pla ACT I SCENE I. Elsinore. A pla ACTI SCENE I.Elsinore. A pla ><TEXT>ACTI SCENE I.Elsin\ntform before the castle. FRANC tform before the castle. FRANC tform beforethe castle. (FRAN ore. A platform beforethe cas\nISCO at his post. Enter to him ISCO at his post. Enter to him CISCO athis post. Enter to hi tle. FRANCISCO at his post. E\nBERNARDO BERNARDO Who’s there BERNARDO BERNARDO Who’s there m BERNARDO) BERNARDO Who’s the nter to him BERNARDO BERNARDO\n? FRANCISCO Nay, answer me: st ? FRANCISCO Nay, answer me: st re? FRANCISCO Nay, answer me: Who’s there? FRANCISCO Nay, an\nand, and unfold yourself. BERN and, and unfold yourself. BERN stand, and unfold yourself. BE swer me: stand, and unfold you\nARDO ARDO RNARDO Long livethe king! FRA rself. BERNARDO Long livethe\nNCISCO Bernardo? BERNARDO He. king! FRANCISCO Bernardo? BERN\nFRANCISCO You come most carefu ARDO He. FRANCISCO You come mo\nlly upon your hour. BERNARDO ’ st carefully upon your hour. B\nTis now struck twelve; get the ERNARDO ’Tis now struck twelve\ne to bed, Francisco. FRANCISCO ; get thee to bed, Francisco.\nFor this relief much thanks: FRANCISCO For this relief much\n’tis bitter cold, And I am sic thanks: ’tis bitter cold, And\nk at heart. BERNARDO Have you I am sick at heart.\nWho calls ? FRANCISCO I am th Long livethe king! FRANCISCO Long live the king! FRANCISCO live the king! FRANCISCO Bern\ne lord here; I, Francisco, tha Bernardo? BERNARDO He. FRANCI Bernardo? BERNARDO He. FRANCI ardo? BERNARDO He. FRANCISCO Y\nt am sick of grief. [ Aside. B SCO You come most carefully up SCO You come most carefully up ou come most carefully upon yo\nERNARDO The king ! FRANCISCO I on your hour. BERNARDO ’Tis no on your hour. BERNARDO ’Tis no ur hour. BERNARDO ’Tis now str\nam sick of that also. BERNARD w struck twelve; get thee to b w struck twelve; get thee to b uck twelve: get thee to bed, F\nO My lord ? FRANCISCO Do not a ed, Francisco. FRANCISCO For t ed, Francisco. FRANCISCO For t rancisco. FRANCISCO For this r\npproach me. BERNARDO his relief much thanks: ’tis b his relief much thanks: ’tis b elief much thanks: ’tis bitter\nitter cold, And I am sick at h itter cold, And I am sick at h cold, And I am sick at heart.\neart. B eart.</TEXT></DOC><DOC><DO BERNARDO Haveyou had quiet g\nCNO>romeo</DOCNO><TEXT>ACTI uard? FRANCISCO Not a mouse st\nPROLOGUE Two households, bo irring. BERNARDO Well, good ni\nth alike in dignity, In fair V ght. Ifyoudo meet Horatio and\nerona, where we lay our scene, Marcellus, The rivals2 ofmy\nFrom ancient grudge break to watch, bid them make haste. FR\nnew mutiny, ANCISCO I think I hear them.—\nStand, ho! who is there?EN\nFrancisco, I would speak with ERNARDO Haveyou had quiet gua had quiet guard? FRANCISCO No ARDO Haveyouhad quiet guard?\nyou. FRANCISCO Approach me not rd? FRANCISCO Not a mouse stir t a mouse stirring. BERNARDO W FRANCISCO Not a mouse stirrin\n, but speak. BERNARDO Your han ring. BERNARDO Well, good nigh ell, goodnight. If youdo mee g. BERNARDO Well, goodnight.\nd, your voice FRANCISCO I will t. Ifyou domeet Horatio and t Horatioand Marcellus, The r Ifyou do meet Horatioand Marc\nnot hear thee speak. BERNARDO Marcellus, The rivalsofmy wa ivals ofmy watch, bid them ma ellus, The rivals2 ofmy watch\nFrancisco, your hand, I entre tch, bid them make haste. FRAN ke haste. FRANCISCO I think I , bid them make haste. FRANCIS\nat thee. FRANCISCO Approach me CISCO I think I hear them. Sta hear them. Stand, ho! Who’s th CO I think I hear them.— Stand\nnot. BERNARDO Francisco FRANC nd, ho! who is there?Enter ere? (Enter HORATIO and MARCEL , ho! who is there? ENTER HORA\nLUS) HORATIO Friends to this g TIO AND MARCELLUS. HORATIO Fri\nround. MARCELLUS And liegemen ends to this ground. MARCELLUS\nto the Dane. FRANCISCO Give yo And liegemen to the Dane.3 FR\nu good night. MARCELLUS O, far ANCISCO Giveyougood night. M\newell, honest soldier: Who hat ARCELLUS O, farewell, honest s\nh relieved you? FRANCISCO Bern oldier: Who hath relieved you?\nardo hasmy place. Giveyou go FRANCISCO Bernardo hath my pl\nod night.(Exit ace. Give you good night\nISCO Approach me not. BERNARDO HORATIO and MARCELLUSHORATIO\nI have a letter FRANCISCO App Friends to this ground. MARCE\nroach me not. BERNARDO For the LLUS And liegemen to the Dane.\nking. FRANCISCO Approach me n FRANCISCO Give you good night\not. BERNARDO There’s no treaso . MARCELLUS O, farewell, hones\nn in’t. FRANCISCO Approach me t soldier: Who hath relieved y\nnot. BERNARDO I will ou? FRANCISCO Bernardo hath my\nplace. Give you good night.\n\nImproving language models by retrieving from trillions of tokens',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, 0.4898, -0.0854],
# [ 0.4898, 1.0000, -0.0389],
# [-0.0854, -0.0389, 1.0000]])
Training Details
Training Dataset
Unnamed Dataset
- Size: 2,970 training samples
- Columns:
sentence_0andsentence_1 - Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 8 tokens
- mean: 25.32 tokens
- max: 80 tokens
- min: 170 tokens
- mean: 252.85 tokens
- max: 256 tokens
- Samples:
sentence_0 sentence_1 What methods do language models use to predict mutation effects on proteins?[ABSTRACT]
Data augmentation is an important component in the robustness evaluation of models
in natural language processing (NLP) and
in enhancing the diversity of the data they
are trained on. In this paper, we present
NL-Augmenter, a new participatory Pythonbased natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features).
We describe the framework and an initial set
of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using
several of its tranformations to analyze the robustness of popular natural language models.
The infrastructure, datacards and robutstness
analysis results are available publicly on the
NL-Augmenter repository (https://github.
com/GEM-benchmark/NL-Augmenter).
[1 Introduction]
Data augmentation, the act of creating new datapoints by slightly modifying copies or creating
synthetic da...How can I efficiently model long sequences in machine learning?[ABSTRACT]
Many real-world applications require the prediction of long
sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands
a high prediction capacity of the model, which is the ability
to capture precise long-range dependency coupling between
output and input efficiently. Recent studies have shown the
potential of Transformer to increase the prediction capacity.
However, there are several severe issues with Transformer
that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based
model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which
achieves O(Llog L) in time complexity and memory usage,
and has comparable performance on sequences’ dependency
alignment. (ii) the self-attention distill...What methods exist for learning from interconnected datasets?[ABSTRACT]
How to obtain informative representations of molecules is a crucial prerequisite
in AI-driven drug design and discovery. Recent researches abstract molecules as
graphs and employ Graph Neural Networks (GNNs) for molecular representation
learning. Nevertheless, two issues impede the usage of GNNs in real scenarios:
(1) insufficient labeled molecules for supervised training; (2) poor generalization
capability to new-synthesized molecules. To address them both, we propose a
novel framework, GROVER, which stands for Graph Representation frOm selfsuperVised mEssage passing tRansformer. With carefully designed self-supervised
tasks in node-, edge- and graph-level,GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. Rather, to
encode such complex information, GROVER integrates Message Passing Networks
into the Transformer-style architecture to deliver a class of more expressive encoders of molecules. The flexibility of GROVER... - Loss:
MultipleNegativesRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false, "directions": [ "query_to_doc" ], "partition_mode": "joint", "hardness_mode": null, "hardness_strength": 0.0 }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size: 32per_device_eval_batch_size: 32multi_dataset_batch_sampler: round_robin
All Hyperparameters
Click to expand
per_device_train_batch_size: 32num_train_epochs: 3max_steps: -1learning_rate: 5e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1label_smoothing_factor: 0.0bf16: Falsefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: noper_device_eval_batch_size: 32prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}
Framework Versions
- Python: 3.13.11
- Sentence Transformers: 5.3.0
- Transformers: 5.5.0
- PyTorch: 2.11.0+cu130
- Accelerate: 1.13.0
- Datasets: 4.8.4
- Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}