Use full fp16 transformer model
Browse filesUse Split-Scaling to avoid fp16 overflow:
Overflow Pattern:
` SiLU(w1) Γ w3 β [Inf!] β w2 β SLN`
Split Scaling Fix:
SiLU(w1) Γ (1/8) βββ
ββ Mul_1 β w2 β SLN (no overfolw)
w3 Γ (1/16) ββββββ
Math: Mul_1 = SiLU(w1)/8 Γ w3/16 = SiLU(w1)Γw3/128