Order of the LayerNorm in T5 Model

#28

by dkarthikeyan1 - opened Jul 14, 2024

Discussion

dkarthikeyan1

Jul 14, 2024

•

edited Jul 14, 2024

Hi all,

Was just going through the T5 paper and noticed that the authors mention that the LayerNorm was different to the Vaswani et al. 2017 AAYN paper in that the AAYN paper implements LayerNorm on the outputs of the multi-headed attention (MHA) and FFN such that we get LayerNorm(x + SubLayer(x)) whereas T5 applies it on the inputs of the MHA and FFN such that the residual connection becomes: LayerNorm(x) or just x + SubLayer(LayerNorm(x). However when I looked at the T5 model I noticed that the T5LayerNorm comes after the T5Attention. Is this just how the model architecture is printed or a potential detraction from the paper?

Thanks!

dkarthikeyan1 changed discussion status to closed Jul 17, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment