NotImplementedError: self._attn_implementation='flash_attention_2'

#18
by phamvantoan - opened

Hi,

Thank you for your valuable contribution about Locate Anything model

When testing this model, I got an error as follows:

Firstly, I run this model under an environment without flash-attn package. Then, when running, attention is automatically switched to "sdpa" attention and the code runs as normal.

However, when installing flash-attn in my environment using package "flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp311-cp311-linux_x86_64.whl"
(my torch version is 2.7.0+cu126 and python 3.11), I got an error when running the same code

File "/home/quadrep/.cache/huggingface/modules/transformers_modules/weights/modeling_qwen2.py", line 1335, in forward
raise NotImplementedError(f'{self._attn_implementation=}')
NotImplementedError: self._attn_implementation='flash_attention_2'

Because I am focusing on computation time and occupied VRAM, I want to try using flash-attn if it helps

Could you please let me know what my problem is and how to fix it?

Hi there!

Thank you so much for your interest and for testing out the Locate Anything model!

To answer your question: the hybrid and fast modes in Locate Anything currently only support Magi Attention and do not support Flash Attention (specifically, only the MoonViT component can use Flash Attention, but the LLM component cannot).

If your hardware supports Magi Attention, please definitely give it a try!

I completely understand your concerns about computation time and VRAM, as sdpa and eager modes do consume quite a lot of memory. To help solve this, I am currently working on adding Flex Attention support to the language model. I really hope this update will be helpful for your VRAM usage in the near future!

Please let me know if you have any other questions. Happy coding!
@phamvantoan

Sign up or log in to comment