add attention return + support eager attention or triton FA2 via config.use_flash_attn

Browse files

Files changed (3) hide show

README.md +24 -3
bert_layers.py +92 -39
flash_attn_triton.py +3 -3

README.md CHANGED Viewed

@@ -7,12 +7,33 @@ tags:
 - medical
 - genomics
 ---
 This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
 ](https://arxiv.org/pdf/2306.15006.pdf).
-We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
-DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
 To load the model from huggingface:
 ```
@@ -36,4 +57,4 @@ print(embedding_mean.shape) # expect to be 768
 # embedding with max pooling
 embedding_max = torch.max(hidden_states[0], dim=0)[0]
 print(embedding_max.shape) # expect to be 768
-```

 - medical
 - genomics
 ---
+# Note:
+This model is a copied version of DNABERT-2-117M which fixes the FlashAttention integration with Trition (specifically integrating the solution in: https://github.com/Dao-AILab/flash-attention/issues/508) as well as fixes the return of attention weights and hidden states in the forward function of the model. The original DNABERT-2-117M model can be found at https://huggingface.co/zhihan1996/DNABERT-2-117M. If you use this model please provide attribution to the original authors of DNABERT-2 and the MosaicML team for their implementation.
+The only changes made were in `flash_attn_triton.py` and `bert_layers.py`.
+In `flash_attn_triton.py`, the change was to alter:
+1. ```qk += tl.dot(q, k, trans_b=True)``` to ```qk += tl.dot(q, tl.trans(k))``` according to the solution provided in the flash attention issue. There were 2 other instances of the use of this ```trans_b=True``` argument in the file which were also changed to use the same solution.
+In `bert_layers.py` the changes were:
+1. **`use_flash_attn` config flag** (`BertUnpadSelfAttention`): Added `self.use_flash_attn = getattr(config, 'use_flash_attn', True)`. Setting `use_flash_attn: false` in the model config forces the PyTorch eager attention path, enabling attention weight extraction without requiring Triton.
+2. **Attention weight return** (`BertUnpadSelfAttention`, `BertUnpadAttention`, `BertLayer`): Added a `return_attn_weights: bool = False` parameter threaded through the call chain. When enabled, the eager path returns the `(B, H, T, T)` attention probability tensor alongside the hidden states.
+3. **HF-compatible encoder output** (`BertEncoder`): Added `output_attentions: bool = False`. When `output_all_encoded_layers=True`, each layer's hidden states are now padded back to `(B, T, D)` before collection (previously unpadded `(nnz, D)`), and the embedding output is prepended as index 0 to match the HuggingFace `hidden_states` convention.
+4. **Standard HuggingFace output objects** (`BertModel`, `BertForMaskedLM`, `BertForSequenceClassification`): `BertModel.forward` now accepts `output_hidden_states` and `output_attentions` keyword arguments and returns a `BaseModelOutputWithPooling` object with `.last_hidden_state`, `.pooler_output`, `.hidden_states`, and `.attentions` fields. `BertForMaskedLM` and `BertForSequenceClassification` were updated accordingly to read from these named fields.
+# Original README:
 This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
 ](https://arxiv.org/pdf/2306.15006.pdf).
+We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
+DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
 To load the model from huggingface:
 ```
 # embedding with max pooling
 embedding_max = torch.max(hidden_states[0], dim=0)[0]
 print(embedding_max.shape) # expect to be 768
+```

bert_layers.py CHANGED Viewed

@@ -16,7 +16,8 @@ import torch.nn as nn
 from einops import rearrange
 from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
 from transformers.activations import ACT2FN
-from transformers.modeling_outputs import (MaskedLMOutput,
                                            SequenceClassifierOutput)
 from transformers.models.bert.modeling_bert import BertPreTrainedModel
 from transformers.modeling_utils import PreTrainedModel
@@ -120,6 +121,7 @@ class BertUnpadSelfAttention(nn.Module):
         self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
         self.p_dropout = config.attention_probs_dropout_prob
         self.Wqkv = nn.Linear(self.all_head_size, 3 * config.hidden_size)
         # Warn if defaulting to pytorch because of import issues
         if flash_attn_qkvpacked_func is None:
@@ -129,7 +131,8 @@ class BertUnpadSelfAttention(nn.Module):
     def forward(self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor,
                 max_seqlen_in_batch: int, indices: torch.Tensor,
-                attn_mask: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
         """Perform self-attention.
         If dropout is zero, then we can use the Triton kernel, so we do that. However, if not, we send through a standard PyTorch
@@ -158,7 +161,7 @@ class BertUnpadSelfAttention(nn.Module):
                         'b s (t h d) -> b s t h d',
                         t=3,
                         h=self.num_attention_heads)
-        if self.p_dropout or flash_attn_qkvpacked_func is None:
             # if we have nonzero attention dropout (e.g. during fine-tuning) or no Triton, compute attention in PyTorch
             q = qkv[:, :, 0, :, :].permute(0, 2, 1, 3)  # b h s d
             k = qkv[:, :, 1, :, :].permute(0, 2, 3, 1)  # b h d s
@@ -172,6 +175,7 @@ class BertUnpadSelfAttention(nn.Module):
                                                                  3)  # b s h d
         else:
             # Triton implementation only supports 0 attention dropout
             convert_dtype = qkv.dtype not in [torch.float16, torch.bfloat16]
             if convert_dtype:
                 # Triton implementation only supports fp16 and bf16
@@ -187,7 +191,10 @@ class BertUnpadSelfAttention(nn.Module):
         # attn_mask is 1 for attend and 0 for don't
         attention = unpad_input_only(attention, torch.squeeze(attn_mask) == 1)
-        return rearrange(attention, 'nnz h d -> nnz (h d)')
 # Copy of transformer's library BertSelfOutput that will not be caught by surgery methods looking for HF BERT modules.
@@ -225,6 +232,7 @@ class BertUnpadAttention(nn.Module):
         indices: Optional[torch.Tensor] = None,
         attn_mask: Optional[torch.Tensor] = None,
         bias: Optional[torch.Tensor] = None,
     ) -> torch.Tensor:
         """Forward pass for scaled self-attention without padding.
@@ -237,14 +245,24 @@ class BertUnpadAttention(nn.Module):
             indices: None or (total_nnz,)
             attn_mask: None or (batch, max_seqlen_in_batch)
             bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch)
         """
-        self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
-                                attn_mask, bias)
         if subset_idx is not None:
-            return self.output(index_first_axis(self_output, subset_idx),
-                               index_first_axis(input_tensor, subset_idx))
         else:
-            return self.output(self_output, input_tensor)
 class BertGatedLinearUnitMLP(nn.Module):
@@ -312,6 +330,7 @@ class BertLayer(nn.Module):
         indices: Optional[torch.Tensor] = None,
         attn_mask: Optional[torch.Tensor] = None,
         bias: Optional[torch.Tensor] = None,
     ) -> torch.Tensor:
         """Forward pass for a BERT layer, including both attention and MLP.
@@ -324,10 +343,19 @@ class BertLayer(nn.Module):
             indices: None or (total_nnz,)
             attn_mask: None or (batch, max_seqlen_in_batch)
             bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch)
         """
-        attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
-                                          subset_idx, indices, attn_mask, bias)
         layer_output = self.mlp(attention_output)
         return layer_output
@@ -410,7 +438,8 @@ class BertEncoder(nn.Module):
         attention_mask: torch.Tensor,
         output_all_encoded_layers: Optional[bool] = True,
         subset_mask: Optional[torch.Tensor] = None,
-    ) -> List[torch.Tensor]:
         extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
         extended_attention_mask = extended_attention_mask.to(
@@ -419,6 +448,12 @@ class BertEncoder(nn.Module):
         attention_mask_bool = attention_mask.bool()
         batch, seqlen = hidden_states.shape[:2]
         # Unpad inputs and mask. It will remove tokens that are padded.
         # Assume ntokens is total number of tokens (padded and non-padded)
         # and ntokens_unpad is total number of non-padded tokens.
@@ -442,17 +477,27 @@ class BertEncoder(nn.Module):
         alibi_attn_mask = attn_bias + alibi_bias
         all_encoder_layers = []
         if subset_mask is None:
             for layer_module in self.layer:
-                hidden_states = layer_module(hidden_states,
-                                             cu_seqlens,
-                                             seqlen,
-                                             None,
-                                             indices,
-                                             attn_mask=attention_mask,
-                                             bias=alibi_attn_mask)
                 if output_all_encoded_layers:
-                    all_encoder_layers.append(hidden_states)
             # Pad inputs and mask. It will insert back zero-padded tokens.
             # Assume ntokens is total number of tokens (padded and non-padded)
             # and ntokens_unpad is total number of non-padded tokens.
@@ -483,7 +528,13 @@ class BertEncoder(nn.Module):
         if not output_all_encoded_layers:
             all_encoder_layers.append(hidden_states)
-        return all_encoder_layers
 class BertPooler(nn.Module):
@@ -586,8 +637,10 @@ class BertModel(BertPreTrainedModel):
         position_ids: Optional[torch.Tensor] = None,
         output_all_encoded_layers: Optional[bool] = False,
         masked_tokens_mask: Optional[torch.Tensor] = None,
         **kwargs
-    ) -> Tuple[Union[List[torch.Tensor], torch.Tensor], Optional[torch.Tensor]]:
         if attention_mask is None:
             attention_mask = torch.ones_like(input_ids)
         if token_type_ids is None:
@@ -606,11 +659,12 @@ class BertModel(BertPreTrainedModel):
             first_col_mask[:, 0] = True
             subset_mask = masked_tokens_mask | first_col_mask
-        encoder_outputs = self.encoder(
             embedding_output,
             attention_mask,
-            output_all_encoded_layers=output_all_encoded_layers,
-            subset_mask=subset_mask)
         if masked_tokens_mask is None:
             sequence_output = encoder_outputs[-1]
@@ -629,13 +683,12 @@ class BertModel(BertPreTrainedModel):
             else:
                 pooled_output = None
-        if not output_all_encoded_layers:
-            encoder_outputs = sequence_output
-        if self.pooler is not None:
-            return encoder_outputs, pooled_output
-        return encoder_outputs, None
 ###################
@@ -755,8 +808,8 @@ class BertForMaskedLM(BertPreTrainedModel):
             return_dict=return_dict,
             masked_tokens_mask=masked_tokens_mask,
         )
-        sequence_output = outputs[0]
         prediction_scores = self.cls(sequence_output)
         loss = None
@@ -782,8 +835,8 @@ class BertForMaskedLM(BertPreTrainedModel):
         return MaskedLMOutput(
             loss=loss,
             logits=prediction_scores,
-            hidden_states=outputs[0],
-            attentions=None,
         )
     def prepare_inputs_for_generation(self, input_ids: torch.Tensor,
@@ -868,7 +921,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
             return_dict=return_dict,
         )
-        pooled_output = outputs[1]
         pooled_output = self.dropout(pooled_output)
         logits = self.classifier(pooled_output)
@@ -906,7 +959,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
         return SequenceClassifierOutput(
             loss=loss,
             logits=logits,
-            hidden_states=outputs[0],
-            attentions=None,
         )

 from einops import rearrange
 from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
 from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (BaseModelOutputWithPooling,
+                                           MaskedLMOutput,
                                            SequenceClassifierOutput)
 from transformers.models.bert.modeling_bert import BertPreTrainedModel
 from transformers.modeling_utils import PreTrainedModel
         self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
         self.p_dropout = config.attention_probs_dropout_prob
         self.Wqkv = nn.Linear(self.all_head_size, 3 * config.hidden_size)
+        self.use_flash_attn = getattr(config, 'use_flash_attn', True)
         # Warn if defaulting to pytorch because of import issues
         if flash_attn_qkvpacked_func is None:
     def forward(self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor,
                 max_seqlen_in_batch: int, indices: torch.Tensor,
+                attn_mask: torch.Tensor, bias: torch.Tensor,
+                return_attn_weights: bool = False) -> torch.Tensor:
         """Perform self-attention.
         If dropout is zero, then we can use the Triton kernel, so we do that. However, if not, we send through a standard PyTorch
                         'b s (t h d) -> b s t h d',
                         t=3,
                         h=self.num_attention_heads)
+        if self.p_dropout or flash_attn_qkvpacked_func is None or not self.use_flash_attn or return_attn_weights:
             # if we have nonzero attention dropout (e.g. during fine-tuning) or no Triton, compute attention in PyTorch
             q = qkv[:, :, 0, :, :].permute(0, 2, 1, 3)  # b h s d
             k = qkv[:, :, 1, :, :].permute(0, 2, 3, 1)  # b h d s
                                                                  3)  # b s h d
         else:
             # Triton implementation only supports 0 attention dropout
+            attention_probs = None
             convert_dtype = qkv.dtype not in [torch.float16, torch.bfloat16]
             if convert_dtype:
                 # Triton implementation only supports fp16 and bf16
         # attn_mask is 1 for attend and 0 for don't
         attention = unpad_input_only(attention, torch.squeeze(attn_mask) == 1)
+        out = rearrange(attention, 'nnz h d -> nnz (h d)')
+        if return_attn_weights:
+            return out, attention_probs  # (nnz, D), (B, H, T, T)
+        return out
 # Copy of transformer's library BertSelfOutput that will not be caught by surgery methods looking for HF BERT modules.
         indices: Optional[torch.Tensor] = None,
         attn_mask: Optional[torch.Tensor] = None,
         bias: Optional[torch.Tensor] = None,
+        return_attn_weights: bool = False,
     ) -> torch.Tensor:
         """Forward pass for scaled self-attention without padding.
             indices: None or (total_nnz,)
             attn_mask: None or (batch, max_seqlen_in_batch)
             bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch)
+            return_attn_weights: If True, return attention probabilities alongside output.
         """
+        if return_attn_weights:
+            self_output, attn_probs = self.self(
+                input_tensor, cu_seqlens, max_s, indices, attn_mask, bias,
+                return_attn_weights=True)
+        else:
+            self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
+                                    attn_mask, bias)
+            attn_probs = None
         if subset_idx is not None:
+            output = self.output(index_first_axis(self_output, subset_idx),
+                                 index_first_axis(input_tensor, subset_idx))
         else:
+            output = self.output(self_output, input_tensor)
+        if return_attn_weights:
+            return output, attn_probs
+        return output
 class BertGatedLinearUnitMLP(nn.Module):
         indices: Optional[torch.Tensor] = None,
         attn_mask: Optional[torch.Tensor] = None,
         bias: Optional[torch.Tensor] = None,
+        return_attn_weights: bool = False,
     ) -> torch.Tensor:
         """Forward pass for a BERT layer, including both attention and MLP.
             indices: None or (total_nnz,)
             attn_mask: None or (batch, max_seqlen_in_batch)
             bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch)
+            return_attn_weights: If True, return attention probabilities alongside output.
         """
+        if return_attn_weights:
+            attention_output, attn_probs = self.attention(
+                hidden_states, cu_seqlens, seqlen, subset_idx, indices,
+                attn_mask, bias, return_attn_weights=True)
+        else:
+            attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
+                                              subset_idx, indices, attn_mask, bias)
+            attn_probs = None
         layer_output = self.mlp(attention_output)
+        if return_attn_weights:
+            return layer_output, attn_probs
         return layer_output
         attention_mask: torch.Tensor,
         output_all_encoded_layers: Optional[bool] = True,
         subset_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[List[torch.Tensor], Optional[Tuple[torch.Tensor, ...]]]:
         extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
         extended_attention_mask = extended_attention_mask.to(
         attention_mask_bool = attention_mask.bool()
         batch, seqlen = hidden_states.shape[:2]
+        # Capture padded embedding output (B, T, D) before unpadding, so it
+        # can be prepended to all_encoder_layers as hidden_states index 0 in
+        # the HF convention (embedding = index 0, layer i = index i+1).
+        padded_embedding = hidden_states
         # Unpad inputs and mask. It will remove tokens that are padded.
         # Assume ntokens is total number of tokens (padded and non-padded)
         # and ntokens_unpad is total number of non-padded tokens.
         alibi_attn_mask = attn_bias + alibi_bias
         all_encoder_layers = []
+        all_attention_probs: List[torch.Tensor] = []
         if subset_mask is None:
             for layer_module in self.layer:
+                if output_attentions:
+                    hidden_states, attn_probs = layer_module(
+                        hidden_states, cu_seqlens, seqlen, None, indices,
+                        attn_mask=attention_mask, bias=alibi_attn_mask,
+                        return_attn_weights=True)
+                    all_attention_probs.append(attn_probs)
+                else:
+                    hidden_states = layer_module(hidden_states,
+                                                 cu_seqlens,
+                                                 seqlen,
+                                                 None,
+                                                 indices,
+                                                 attn_mask=attention_mask,
+                                                 bias=alibi_attn_mask)
                 if output_all_encoded_layers:
+                    # Pad back to (B, T, D) so callers get consistent shapes.
+                    all_encoder_layers.append(
+                        pad_input(hidden_states, indices, batch, seqlen))
             # Pad inputs and mask. It will insert back zero-padded tokens.
             # Assume ntokens is total number of tokens (padded and non-padded)
             # and ntokens_unpad is total number of non-padded tokens.
         if not output_all_encoded_layers:
             all_encoder_layers.append(hidden_states)
+        else:
+            # Prepend padded embedding as index 0 to match HF convention:
+            # hidden_states[0] = embedding, hidden_states[i+1] = layer i output.
+            all_encoder_layers.insert(0, padded_embedding)
+        attn_out = tuple(all_attention_probs) if output_attentions else None
+        return all_encoder_layers, attn_out
 class BertPooler(nn.Module):
         position_ids: Optional[torch.Tensor] = None,
         output_all_encoded_layers: Optional[bool] = False,
         masked_tokens_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: bool = False,
+        output_attentions: bool = False,
         **kwargs
+    ) -> BaseModelOutputWithPooling:
         if attention_mask is None:
             attention_mask = torch.ones_like(input_ids)
         if token_type_ids is None:
             first_col_mask[:, 0] = True
             subset_mask = masked_tokens_mask | first_col_mask
+        encoder_outputs, all_attentions = self.encoder(
             embedding_output,
             attention_mask,
+            output_all_encoded_layers=output_hidden_states,
+            subset_mask=subset_mask,
+            output_attentions=output_attentions)
         if masked_tokens_mask is None:
             sequence_output = encoder_outputs[-1]
             else:
                 pooled_output = None
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=tuple(encoder_outputs) if output_hidden_states else None,
+            attentions=all_attentions,
+        )
 ###################
             return_dict=return_dict,
             masked_tokens_mask=masked_tokens_mask,
         )
+        sequence_output = outputs.last_hidden_state
         prediction_scores = self.cls(sequence_output)
         loss = None
         return MaskedLMOutput(
             loss=loss,
             logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
         )
     def prepare_inputs_for_generation(self, input_ids: torch.Tensor,
             return_dict=return_dict,
         )
+        pooled_output = outputs.pooler_output
         pooled_output = self.dropout(pooled_output)
         logits = self.classifier(pooled_output)
         return SequenceClassifierOutput(
             loss=loss,
             logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
         )

flash_attn_triton.py CHANGED Viewed

@@ -188,7 +188,7 @@ def _fwd_kernel(
                             (offs_d[None, :] < headdim),
                             other=0.0)
         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
-        qk += tl.dot(q, k, trans_b=True)
         # Trying to combine the two masks seem to make the result wrong
         if not EVEN_N:  # Need to mask out otherwise the softmax is wrong
             qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0,
@@ -431,7 +431,7 @@ def _bwd_kernel_one_col_block(
                             (offs_d[None, :] < headdim),
                             other=0.0)
         # recompute p = softmax(qk, dim=-1).T
-        qk = tl.dot(q, k, trans_b=True)
         # Trying to combine the two masks seem to make the result wrong
         if not EVEN_N:  # Need to mask out otherwise the softmax is wrong
             qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
@@ -498,7 +498,7 @@ def _bwd_kernel_one_col_block(
         # Also wrong for headdim=64, seqlen=(1023, 1024), and ATOMIC_ADD=False
         if not (EVEN_M & EVEN_HEADDIM):
             tl.debug_barrier()
-        dp = tl.dot(do, v, trans_b=True)
         # There's a race condition for headdim=48
         if not EVEN_HEADDIM:
             tl.debug_barrier()

                             (offs_d[None, :] < headdim),
                             other=0.0)
         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        qk += tl.dot(q, tl.trans(k)) # see issue: https://github.com/Dao-AILab/flash-attention/issues/508
         # Trying to combine the two masks seem to make the result wrong
         if not EVEN_N:  # Need to mask out otherwise the softmax is wrong
             qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0,
                             (offs_d[None, :] < headdim),
                             other=0.0)
         # recompute p = softmax(qk, dim=-1).T
+        qk = tl.dot(q, tl.trans(k)) # see issue: https://github.com/Dao-AILab/flash-attention/issues/508
         # Trying to combine the two masks seem to make the result wrong
         if not EVEN_N:  # Need to mask out otherwise the softmax is wrong
             qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
         # Also wrong for headdim=64, seqlen=(1023, 1024), and ATOMIC_ADD=False
         if not (EVEN_M & EVEN_HEADDIM):
             tl.debug_barrier()
+        dp = tl.dot(do, tl.trans(v)) # see issue: https://github.com/Dao-AILab/flash-attention/issues/508
         # There's a race condition for headdim=48
         if not EVEN_HEADDIM:
             tl.debug_barrier()