Gemma 2's Flash attention 2 implementation is strange...

#23

by GPT007 - opened Jul 11, 2024

Jul 11, 2024

I tested with torch.manual_seed(0).

eager attention => normal result
flash attention 2 => 1's not. The 2's "to be's for.3' for4. 2 That 4 2 the 4 that 4 for. 4's 4' to 4''' the 4'' to. 4' 4 4 4to lose to. 4 the' 4 4 4' 4' 4 the 4 the 4 4 4 ...

rsdfsfas

Jul 12, 2024

It is almost the same without any attention

GPT007

Jul 12, 2024

With "eager", it works good.

zokica

Jul 12, 2024

yes, it should be fixed when you install new version of flash attention from source.

GPT007

Jul 12, 2024

I installed it yesterday 😅
And on windows, so it took a few hours 😨

GPT007

Jul 12, 2024

pip freeze | findstr flash-attn
flash-attn==2.5.9.post1

GPT007

Jul 12, 2024

•

edited Jul 12, 2024

OH NO

GPT007

Jul 12, 2024

Took 2 hours, but finally installed flash-attention >= 2.6.0

GPT007 changed discussion status to closed Jul 12, 2024

GPT007 changed discussion status to open Jul 12, 2024

62 hidden messages

Expand all

rsdfsfas

Jul 28, 2024

It works just by changing these lines, it is a bit slower than without flash attention and it use the same amount of memory.

Maybe there is still something broken.

It does output good response.

GPT007

Jul 28, 2024

Started process with the eager attn_implementation.
The eager attn_implementation took 15.17s to infer {tokens} tokens.
Started process with the sdpa attn_implementation.
The sdpa attn_implementation took 21.51s to infer {tokens} tokens.
Started process with the flash_attention_2 attn_implementation.
The flash_attention_2 attn_implementation took 30.53s to infer {tokens} tokens.

rsdfsfas

Jul 28, 2024

Yes, something very wrong. Probably won't be fixed.

zokica

Aug 7, 2024

This might fix this, at least memory part: https://github.com/huggingface/transformers/pull/31292

GPT007

Aug 7, 2024

I know, but we need to ask to apply it to gemma2, not only in gemma (1).

GPT007 changed discussion status to open Aug 7, 2024

zokica

Aug 7, 2024

All Gemmas are is included, as far as I know.

GPT007

Aug 8, 2024

I looked at the commits, and it changed the global generation utils AND for "the most used models" that includes Gemma (1), not Gemma 2.

GPT007

Aug 8, 2024

Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment