damerajee
/

Llamoe-test

Text Generation

Mixture of Experts

meta-llama/Llama-2-7b-hf

syzymon/long_llama_code_7b_instruct

georgesung/llama2_7b_chat_uncensored

togethercomputer/LLaMA-2-7B-32K

Model card Files Files and versions

damerajee commited on Mar 17, 2024

Commit

0d335f8

·

verified ·

1 Parent(s): fe8a29e

Update modeling_Llamoe.py

Files changed (1) hide show

modeling_Llamoe.py +2 -3

modeling_Llamoe.py CHANGED Viewed

@@ -660,9 +660,8 @@ class LlamoeSdpaAttention(LlamoeAttention):
         key_states = repeat_kv(key_states, self.num_key_value_groups)
         value_states = repeat_kv(value_states, self.num_key_value_groups)
-        causal_mask = attention_mask
-        if attention_mask is not None and cache_position is not None:
-            causal_mask = torch.tril(torch.ones((bsz, q_len, q_len), device=query_states.device))
         # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
         # Reference: https://github.com/pytorch/pytorch/issues/112577.

         key_states = repeat_kv(key_states, self.num_key_value_groups)
         value_states = repeat_kv(value_states, self.num_key_value_groups)
+        causal_mask = torch.tril(torch.ones((bsz, q_len, q_len), device=query_states.device))
         # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
         # Reference: https://github.com/pytorch/pytorch/issues/112577.