Sparse attention fixed

Files changed (5) hide show

README.md +47 -0
config.json +5 -0
configuration_rugpt3xl.py +10 -0
modeling_rugpt3xl.py +133 -4
tokenizer.json +0 -0

README.md CHANGED Viewed

@@ -38,6 +38,7 @@ Details in "[A family of pretrained transformer language models for Russian](htt
 | Activation | GELU |
 | Normalization | Pre-LayerNorm |
 | Position encoding | Learned absolute |
 | Precision | float16 |
 | Training data | 80B tokens of Russian text (4 epochs) |
 | Test perplexity | 12.05 |
@@ -225,6 +226,46 @@ trainer.train()
 **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj`
 ## Architecture Details
 The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`):
@@ -254,6 +295,10 @@ RuGPT3XLForCausalLM
   └── lm_head                  (Linear: 2048 -> 50264, no bias)
 ```
 ## Conversion
 This model was converted from the original Megatron-LM checkpoint using a custom script.
@@ -291,5 +336,7 @@ For full conversion details and the script, see the
 ## Links
 - [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar
 - [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model
 - [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase

 | Activation | GELU |
 | Normalization | Pre-LayerNorm |
 | Position encoding | Learned absolute |
+| Attention | Alternating sparse/dense (see [Sparse Attention](#sparse-attention)) |
 | Precision | float16 |
 | Training data | 80B tokens of Russian text (4 epochs) |
 | Test perplexity | 12.05 |
 **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj`
+## Sparse Attention
+This model was originally trained with DeepSpeed's
+[SparseSelfAttention](https://www.deepspeed.ai/tutorials/sparse-attention/) using the
+**alternating** pattern: even layers (0, 2, 4, ...) use block-sparse attention, odd layers
+(1, 3, 5, ...) use standard dense causal attention. The sparse layers use a
+`FixedSparsityConfig` derived from the "Generating Long Sequences with Sparse Transformers"
+paper (Child et al., 2019).
+This is a **critical** architectural detail. The model weights were optimized for this specific
+attention pattern during training. Running the model with all-dense attention degrades
+perplexity from ~12 to ~50.
+The converted model **fully replicates** this sparse attention pattern without any DeepSpeed
+dependency, using a precomputed block-sparse mask applied to standard dense attention.
+| Attention mode | Test PPL (Gazeta) |
+|---|---|
+| Sparse alternating (original training regime) | **11.68** |
+| All-dense (no sparse mask) | ~50.1 |
+### Sparse attention parameters
+The sparse pattern is controlled by `config.json` fields:
+| Parameter | Value | Description |
+|---|---|---|
+| `sparse_mode` | `"alternating"` | Even layers sparse, odd layers dense |
+| `sparse_block_size` | `16` | Token block size for sparse layout |
+| `sparse_num_local_blocks` | `8` | Local attention window (8 blocks = 128 tokens) |
+| `sparse_num_global_blocks` | `1` | Global blocks per window |
+| `sparse_num_different_global_patterns` | `8` | Different heads use different global positions |
+Each sparse layer applies a per-head block-sparse mask. Within each window of 128 tokens,
+attention is causal (lower-triangular). Across windows, only designated "global" blocks are
+visible, with each attention head using a different global block position within the window.
+To disable sparse attention (e.g. for experiments), set `sparse_mode` to `"none"` in
+`config.json`. This will make all layers use standard dense causal attention.
 ## Architecture Details
 The model implements a custom `RuGPT3XLForCausalLM` class (loaded via `trust_remote_code=True`):
   └── lm_head                  (Linear: 2048 -> 50264, no bias)
 ```
+Even-numbered decoder layers (0, 2, 4, ...) apply block-sparse attention masks. Odd-numbered
+layers use full causal attention. The sparse layout is precomputed at model initialization from
+the config parameters and stored as a non-persistent buffer.
 ## Conversion
 This model was converted from the original Megatron-LM checkpoint using a custom script.
 ## Links
 - [A family of pretrained transformer language models for Russian](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=yPayeJIAAAAJ&citation_for_view=yPayeJIAAAAJ:Se3iqnhoufwC) - paper on Google Scholar
+- [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509) - sparse attention paper (Child et al., 2019)
 - [ai-forever/rugpt3xl](https://huggingface.co/ai-forever/rugpt3xl) - original model
 - [ai-forever/ru-gpts](https://github.com/ai-forever/ru-gpts) - original training codebase
+- [DeepSpeed Sparse Attention](https://www.deepspeed.ai/tutorials/sparse-attention/) - original sparse attention implementation

config.json CHANGED Viewed

@@ -25,6 +25,11 @@
   "eos_token_id": 1,
   "pad_token_id": 0,
   "tie_word_embeddings": false,
   "torch_dtype": "float16",
   "transformers_version": "5.3.0"
 }

   "eos_token_id": 1,
   "pad_token_id": 0,
   "tie_word_embeddings": false,
+  "sparse_mode": "alternating",
+  "sparse_block_size": 16,
+  "sparse_num_local_blocks": 8,
+  "sparse_num_global_blocks": 1,
+  "sparse_num_different_global_patterns": 8,
   "torch_dtype": "float16",
   "transformers_version": "5.3.0"
 }

configuration_rugpt3xl.py CHANGED Viewed

@@ -35,6 +35,11 @@ class RuGPT3XLConfig(PretrainedConfig):
         eos_token_id=1,
         pad_token_id=0,
         tie_word_embeddings=False,
         **kwargs,
     ):
         self.vocab_size = vocab_size
@@ -50,6 +55,11 @@ class RuGPT3XLConfig(PretrainedConfig):
         self.attention_dropout = attention_dropout
         self.output_dropout = output_dropout
         self.use_cache = use_cache
         super().__init__(
             bos_token_id=bos_token_id,

         eos_token_id=1,
         pad_token_id=0,
         tie_word_embeddings=False,
+        sparse_mode="none",
+        sparse_block_size=16,
+        sparse_num_local_blocks=8,
+        sparse_num_global_blocks=1,
+        sparse_num_different_global_patterns=8,
         **kwargs,
     ):
         self.vocab_size = vocab_size
         self.attention_dropout = attention_dropout
         self.output_dropout = output_dropout
         self.use_cache = use_cache
+        self.sparse_mode = sparse_mode
+        self.sparse_block_size = sparse_block_size
+        self.sparse_num_local_blocks = sparse_num_local_blocks
+        self.sparse_num_global_blocks = sparse_num_global_blocks
+        self.sparse_num_different_global_patterns = sparse_num_different_global_patterns
         super().__init__(
             bos_token_id=bos_token_id,

modeling_rugpt3xl.py CHANGED Viewed

@@ -27,6 +27,47 @@ from .configuration_rugpt3xl import RuGPT3XLConfig
 logger = logging.get_logger(__name__)
 class RuGPT3XLAttention(nn.Module):
     def __init__(self, config: RuGPT3XLConfig, layer_idx: int):
         super().__init__()
@@ -201,6 +242,27 @@ class RuGPT3XLModel(RuGPT3XLPreTrainedModel):
         )
         self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
         self.gradient_checkpointing = False
         self.post_init()
@@ -278,7 +340,7 @@ class RuGPT3XLModel(RuGPT3XLPreTrainedModel):
         position_embeds = self.embed_positions(position_ids)
         hidden_states = self.embed_dropout(inputs_embeds + position_embeds)
-        # Build causal 4D attention mask
         causal_mask = self._build_causal_mask(
             batch_size,
             seq_length,
@@ -288,19 +350,38 @@ class RuGPT3XLModel(RuGPT3XLPreTrainedModel):
             attention_mask,
         )
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         next_decoder_cache = None
-        for decoder_layer in self.layers:
             if output_hidden_states:
                 all_hidden_states += (hidden_states,)
             if self.gradient_checkpointing and self.training:
                 layer_outputs = self._gradient_checkpointing_func(
                     decoder_layer.__call__,
                     hidden_states,
-                    causal_mask,
                     position_ids,
                     past_key_values,
                     output_attentions,
@@ -309,7 +390,7 @@ class RuGPT3XLModel(RuGPT3XLPreTrainedModel):
             else:
                 layer_outputs = decoder_layer(
                     hidden_states,
-                    attention_mask=causal_mask,
                     position_ids=position_ids,
                     past_key_value=past_key_values,
                     output_attentions=output_attentions,
@@ -372,6 +453,54 @@ class RuGPT3XLModel(RuGPT3XLPreTrainedModel):
         return causal
 class RuGPT3XLForCausalLM(RuGPT3XLPreTrainedModel):
     _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}

 logger = logging.get_logger(__name__)
+def build_fixed_sparse_layout(
+    num_heads: int,
+    num_blocks: int,
+    num_local_blocks: int,
+    num_global_blocks: int,
+    num_different_global_patterns: int,
+) -> torch.Tensor:
+    """Replicate DeepSpeed FixedSparsityConfig.make_layout() for unidirectional attention.
+    Returns a boolean tensor of shape [num_heads, num_blocks, num_blocks] where
+    True means the query block can attend to the key block.
+    """
+    layout = torch.zeros((num_heads, num_blocks, num_blocks), dtype=torch.bool)
+    # Local attention within fixed-size windows (identical for all heads)
+    for window_start in range(0, num_blocks, num_local_blocks):
+        window_end = min(window_start + num_local_blocks, num_blocks)
+        window_size = window_end - window_start
+        layout[:, window_start:window_end, window_start:window_end] = torch.tril(
+            torch.ones(window_size, window_size, dtype=torch.bool)
+        )
+    # Global attention (per-head: different heads use different global block positions)
+    for h in range(num_heads):
+        first_global = num_local_blocks - (
+            1 + h % num_different_global_patterns
+        ) * num_global_blocks
+        regular_end = num_blocks - (num_blocks % num_local_blocks)
+        for gi in range(first_global, regular_end, num_local_blocks):
+            layout[h, gi:, gi : gi + num_global_blocks] = True
+        if regular_end < num_blocks:
+            start = min(
+                regular_end + first_global, num_blocks - num_global_blocks
+            )
+            layout[h, start:, start : start + num_global_blocks] = True
+    return layout
 class RuGPT3XLAttention(nn.Module):
     def __init__(self, config: RuGPT3XLConfig, layer_idx: int):
         super().__init__()
         )
         self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self._sparse_layers: set = set()
+        if getattr(config, "sparse_mode", "none") == "alternating":
+            self._sparse_layers = {
+                i for i in range(config.num_hidden_layers) if i % 2 == 0
+            }
+        elif getattr(config, "sparse_mode", "none") == "all":
+            self._sparse_layers = set(range(config.num_hidden_layers))
+        if self._sparse_layers:
+            num_blocks = config.max_position_embeddings // config.sparse_block_size
+            layout = build_fixed_sparse_layout(
+                num_heads=config.num_attention_heads,
+                num_blocks=num_blocks,
+                num_local_blocks=config.sparse_num_local_blocks,
+                num_global_blocks=config.sparse_num_global_blocks,
+                num_different_global_patterns=config.sparse_num_different_global_patterns,
+            )
+            self.register_buffer("_sparse_layout", layout, persistent=False)
+        else:
+            self._sparse_layout = None
         self.gradient_checkpointing = False
         self.post_init()
         position_embeds = self.embed_positions(position_ids)
         hidden_states = self.embed_dropout(inputs_embeds + position_embeds)
+        # Build causal 4D attention mask (dense layers)
         causal_mask = self._build_causal_mask(
             batch_size,
             seq_length,
             attention_mask,
         )
+        # Build sparse causal mask (sparse layers)
+        sparse_mask = None
+        if self._sparse_layout is not None and self._sparse_layers:
+            sparse_mask = self._build_sparse_causal_mask(
+                seq_length,
+                past_key_values_length,
+                hidden_states.dtype,
+                hidden_states.device,
+                self._sparse_layout,
+                self.config.sparse_block_size,
+                attention_mask,
+            )
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         next_decoder_cache = None
+        for layer_idx, decoder_layer in enumerate(self.layers):
             if output_hidden_states:
                 all_hidden_states += (hidden_states,)
+            layer_mask = (
+                sparse_mask
+                if (layer_idx in self._sparse_layers and sparse_mask is not None)
+                else causal_mask
+            )
             if self.gradient_checkpointing and self.training:
                 layer_outputs = self._gradient_checkpointing_func(
                     decoder_layer.__call__,
                     hidden_states,
+                    layer_mask,
                     position_ids,
                     past_key_values,
                     output_attentions,
             else:
                 layer_outputs = decoder_layer(
                     hidden_states,
+                    attention_mask=layer_mask,
                     position_ids=position_ids,
                     past_key_value=past_key_values,
                     output_attentions=output_attentions,
         return causal
+    @staticmethod
+    def _build_sparse_causal_mask(
+        seq_length: int,
+        past_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        sparse_layout: torch.Tensor,
+        block_size: int,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Build block-sparse causal mask from precomputed layout.
+        Returns additive mask of shape [1, num_heads, seq_length, total_length].
+        """
+        total_length = past_length + seq_length
+        num_blocks = sparse_layout.shape[1]
+        q_block = (
+            torch.arange(past_length, past_length + seq_length, device=device)
+            // block_size
+        ).clamp(max=num_blocks - 1)
+        k_block = (
+            torch.arange(total_length, device=device) // block_size
+        ).clamp(max=num_blocks - 1)
+        layout_dev = sparse_layout.to(device)
+        block_ok = layout_dev[:, q_block][:, :, k_block]
+        q_pos = torch.arange(
+            past_length, past_length + seq_length, device=device
+        ).unsqueeze(1)
+        k_pos = torch.arange(total_length, device=device).unsqueeze(0)
+        causal_ok = k_pos <= q_pos
+        allowed = block_ok & causal_ok.unsqueeze(0)
+        min_val = torch.finfo(dtype).min
+        mask = torch.where(allowed, 0.0, min_val).to(dtype)
+        mask = mask.unsqueeze(0)
+        if attention_mask is not None:
+            pad_mask = (
+                (1 - attention_mask[:, None, None, :].to(dtype)) * min_val
+            )
+            mask = mask + pad_mask
+        return mask
 class RuGPT3XLForCausalLM(RuGPT3XLPreTrainedModel):
     _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff