attention sinks & backward

by acforvs - opened Aug 5, 2025

Aug 5, 2025

Hi, thanks for all the amazing work!

I noticed that a new s_aux parameter has been added to the fwd pass to support attention sinks. However, I wasn't able to find any related changes in the backward pass.

Does the current implementation support training as well? if not, are there any plans to add support for attention sinks to the bwd pass?

Many thanks,
Vlad

pcuenq

Aug 6, 2025

Good question! Are there any plans for it @danieldk ?

liujingcs

Aug 7, 2025

•

edited Aug 7, 2025

Same question here! Is support for this planned?

KirillShmilovich

Oct 2, 2025

@pcuenq @acforvs any updates here? Also just to follow up, does this mean any training using this kernel (SFT, GRPO) should be avoided?

yinzhangyue

Nov 7, 2025

Hi~ Any support for the backward now? Or still aviod any training?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment