Upload ByteGPT-small

Browse files

Files changed (6) hide show

README.md +199 -0
config.json +19 -0
configuration_bytegpt.py +30 -0
generation_config.json +4 -0
model.safetensors +3 -0
modeling_bytegpt.py +353 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "architectures": [
+    "ByteGPTForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_bytegpt.ByteGPTConfig",
+    "AutoModelForCausalLM": "modeling_bytegpt.ByteGPTForCausalLM"
+  },
+  "block_size": 1024,
+  "dropout": 0.1,
+  "model_type": "ijk_byte_gpt",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_layer": 12,
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.2",
+  "use_flash_attention": false,
+  "vocab_size": 256
+}

configuration_bytegpt.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from transformers import PretrainedConfig
+class ByteGPTConfig(PretrainedConfig):
+    model_type = "ijk_byte_gpt"
+    def __init__(
+        self,
+        vocab_size: int = 259,
+        block_size: int = 128,
+        n_embd: int = 64,
+        n_head: int = 4,
+        n_layer: int = 4,
+        dropout: float = 0.1,
+        use_flash_attention: bool = False,
+        _attn_implementation_autoset: bool = False,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.auto_map = {
+            "AutoConfig": "configuration_bytegpt.ByteGPTConfig",
+            "AutoModelForCausalLM": "modeling_bytegpt.ByteGPTForCausalLM",
+        }
+        self.vocab_size = vocab_size
+        self.block_size = block_size
+        self.n_embd = n_embd
+        self.n_head = n_head
+        self.n_layer = n_layer
+        self.dropout = dropout
+        self.use_flash_attention = use_flash_attention

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.48.2"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:158fa9d9c157e9e64ae65a338a78d81868eb98b86ad361531459f555cd3ccedf
+size 344889712

modeling_bytegpt.py ADDED Viewed

	@@ -0,0 +1,353 @@

+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from torchvision import models
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import CausalLMOutput
+from .configuration_bytegpt import ByteGPTConfig
+try:
+    from flash_attn.flash_attention import FlashAttention
+    FLASH_ATTENTION_AVAILABLE = (
+        True and torch.cuda.is_available()
+    )  # Only available on CUDA
+except ImportError:
+    FLASH_ATTENTION_AVAILABLE = False
+class Head(nn.Module):
+    """One head of self-attention.
+    Args:
+        head_size (int): The size of the head.
+        n_embd (int): The embedding dimension.
+        block_size (int): The block size.
+        dropout (float): The dropout rate.
+        use_flash_attention (bool): Whether to use Flash Attention.
+    Attributes:
+        key (nn.Linear): The linear layer for computing the keys.
+        query (nn.Linear): The linear layer for computing the queries.
+        value (nn.Linear): The linear layer for computing the values.
+        tril (torch.Tensor): The lower triangular matrix.
+        dropout (nn.Dropout): The dropout layer.
+        use_flash_attention (bool): Whether to use Flash Attention.
+        flash_attention (FlashAttention): The FlashAttention module.
+    """
+    def __init__(
+        self,
+        head_size: int,
+        n_embd: int,
+        block_size: int,
+        dropout: float,
+        use_flash_attention: bool = False,
+    ) -> None:
+        super().__init__()
+        self.key = nn.Linear(n_embd, head_size, bias=False)
+        self.query = nn.Linear(n_embd, head_size, bias=False)
+        self.value = nn.Linear(n_embd, head_size, bias=False)
+        self.dropout = nn.Dropout(dropout)
+        # Only enable flash attention if we're on CUDA
+        self.use_flash_attention = use_flash_attention and FLASH_ATTENTION_AVAILABLE
+        if self.use_flash_attention:
+            print("Using Flash Attention")
+            self.flash_attention = FlashAttention()
+        else:
+            if use_flash_attention:
+                print(
+                    "Flash Attention requested but not available. Using standard attention."
+                )
+            self.tril = torch.tril(torch.ones(block_size, block_size))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Perform forward pass through the attention head.
+        Args:
+            x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, embedding_dimension).
+        Returns:
+            torch.Tensor: The output tensor of shape (batch_size, sequence_length, embedding_dimension).
+        """
+        B, T, C = x.shape
+        k = self.key(x)  # (B,T,head_size)
+        q = self.query(x)  # (B,T,head_size)
+        v = self.value(x)  # (B,T,head_size)
+        if self.use_flash_attention:
+            # Flash Attention expects shape (B, H, T, D)
+            out = self.flash_attention(q.unsqueeze(1), k.unsqueeze(1), v.unsqueeze(1))[
+                0
+            ].squeeze(1)
+        else:
+            # Regular attention
+            self.tril = self.tril.to(x.device)
+            wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5  # (B, T, T)
+            wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
+            wei = F.softmax(wei, dim=-1)  # (B, T, T)
+            wei = self.dropout(wei)
+            out = wei @ v  # (B, T, head_size)
+        return out
+class MultiHeadAttention(nn.Module):
+    """Multiple heads of self-attention in parallel.
+    Args:
+        num_heads (int): The number of heads.
+        head_size (int): The size of each head.
+        n_embd (int): The embedding dimension.
+        block_size (int): The block size.
+        dropout (float): The dropout rate.
+        use_flash_attention (bool): Whether to use Flash Attention.
+    Attributes:
+        heads (nn.Modulelist): The list of attention heads.
+        proj (nn.Linear): The linear layer for projecting the concatenated heads.
+        dropout (nn.Dropout): The dropout layer.
+    """
+    def __init__(
+        self,
+        num_heads: int,
+        head_size: int,
+        n_embd: int,
+        block_size: int,
+        dropout: float,
+        use_flash_attention: bool = False,
+    ) -> None:
+        super().__init__()
+        self.heads = nn.ModuleList(
+            [
+                Head(
+                    head_size,
+                    n_embd,
+                    block_size,
+                    dropout,
+                    use_flash_attention=use_flash_attention,
+                )
+                for _ in range(num_heads)
+            ]
+        )
+        self.proj = nn.Linear(n_embd, n_embd)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Perform forward pass through the multi-head attention layer.
+        Args:
+            x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, embedding_dimension).
+        Returns:
+            torch.Tensor: The output tensor of shape (batch_size, sequence_length, embedding_dimension).
+        """
+        out = torch.cat([h(x) for h in self.heads], dim=-1)
+        out = self.dropout(self.proj(out))
+        return out
+class FeedForward(nn.Module):
+    """Simple linear layer followed by a non-linearity.
+    Args:
+        n_embd (int): The embedding dimension.
+        dropout (float): The dropout rate.
+    Attributes:
+        net (nn.Sequential): The sequential network of linear layers and ReLU activation.
+    """
+    def __init__(self, n_embd: int, dropout: float) -> None:
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(n_embd, 4 * n_embd),
+            nn.ReLU(),
+            nn.Linear(4 * n_embd, n_embd),
+            nn.Dropout(dropout),
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Perform forward pass through the feedforward layer.
+        Args:
+            x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, embedding_dimension).
+        Returns:
+            torch.Tensor: The output tensor of shape (batch_size, sequence_length, embedding_dimension).
+        """
+        return self.net(x)
+class Block(nn.Module):
+    """Transformer block: communication followed by computation.
+    Args:
+        n_embd (int): The embedding dimension.
+        n_head (int): The number of attention heads.
+        block_size (int): The block size.
+        dropout (float): The dropout rate.
+        use_flash_attention (bool): Whether to use Flash Attention.
+    Attributes:
+        sa (MultiHeadAttention): The multi-head attention layer.
+        ffwd (FeedForward): The feedforward layer.
+        ln1 (nn.LayerNorm): The layer normalization layer for the first sublayer.
+        ln2 (nn.LayerNorm): The layer normalization layer for the second sublayer.
+    """
+    def __init__(
+        self,
+        n_embd: int,
+        n_head: int,
+        block_size: int,
+        dropout: float,
+        use_flash_attention: bool = False,
+    ) -> None:
+        super().__init__()
+        head_size = n_embd // n_head
+        self.sa = MultiHeadAttention(
+            n_head,
+            head_size,
+            n_embd,
+            block_size,
+            dropout,
+            use_flash_attention=use_flash_attention,
+        )
+        self.ffwd = FeedForward(n_embd, dropout)
+        self.ln1 = nn.LayerNorm(n_embd)
+        self.ln2 = nn.LayerNorm(n_embd)
+        # Remove duplicate flash attention and tril setup since it's handled in Head class
+        self.use_flash_attention = use_flash_attention and FLASH_ATTENTION_AVAILABLE
+        if self.use_flash_attention:
+            print("Using Flash Attention")
+        elif use_flash_attention:
+            print(
+                "Flash Attention requested but not available. Using standard attention."
+            )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Perform forward pass through the transformer block.
+        Args:
+            x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, embedding_dimension).
+        Returns:
+            torch.Tensor: The output tensor of shape (batch_size, sequence_length, embedding_dimension).
+        """
+        x = x + self.sa(self.ln1(x))
+        x = x + self.ffwd(self.ln2(x))
+        return x
+class ByteGPTForCausalLM(PreTrainedModel):
+    config_class = ByteGPTConfig
+    def __init__(
+        self,
+        config: ByteGPTConfig,
+    ):
+        super().__init__(config)
+        self.block_size = config.block_size
+        self.token_embedding_table = nn.Embedding(config.vocab_size, config.n_embd)
+        self.position_embedding_table = nn.Embedding(config.block_size, config.n_embd)
+        self.blocks = nn.Sequential(
+            *[
+                Block(
+                    config.n_embd,
+                    config.n_head,
+                    config.block_size,
+                    config.dropout,
+                    config.use_flash_attention,
+                )
+                for _ in range(config.n_layer)
+            ]
+        )
+        self.ln_f = nn.LayerNorm(config.n_embd)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        return_dict: bool = True,
+        labels: torch.Tensor = None,
+        **kwargs
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """
+        Forward pass of the model.
+        Args:
+            idx: Input tensor.
+            targets: Target tensor.
+        Returns:
+            tuple of logits and loss.
+        """
+        B, T = input_ids.shape
+        # Token and position embeddings
+        tok_emb = self.token_embedding_table(input_ids)  # (B,T,C)
+        pos_emb = self.position_embedding_table(
+            torch.arange(T, device=input_ids.device)
+        )  # (T,C)
+        x = tok_emb + pos_emb  # (B,T,C)
+        # Transformer blocks
+        x = self.blocks(x)  # (B,T,C)
+        x = self.ln_f(x)  # (B,T,C)
+        # Language model head
+        logits = self.lm_head(x)  # (B,T,vocab_size)
+        if labels is None:
+            loss = None
+        else:
+            B, T, C = logits.shape
+            logits = logits.view(B * T, C)
+            labels = labels.view(B * T)
+            loss = F.cross_entropy(logits, labels)
+        if not return_dict:
+            return (logits, loss)
+        return CausalLMOutput(logits=logits, loss=loss)
+    def prepare_inputs_for_generation(self, input_ids, **kwargs):
+        # Required for .generate() to work
+        return {
+            "input_ids": input_ids,
+            "attention_mask": torch.ones_like(input_ids),
+        }
+    # def generate(
+    #     self, input_ids: torch.Tensor, max_new_tokens: int, temperature: float = 1.0
+    # ) -> torch.Tensor:
+    #     """
+    #     Generate text tokens autoregressively.
+    #     Args:
+    #         idx: Context tokens
+    #         max_new_tokens: Number of tokens to generate
+    #         temperature: Sampling temperature (higher = more random)
+    #     Returns:
+    #         Generated token sequence
+    #     """
+    #     for _ in range(max_new_tokens):
+    #         # Crop context if needed
+    #         idx_cond = input_ids[:, -self.block_size :]
+    #         # Get predictions
+    #         logits, _ = self(idx_cond)
+    #         # Focus only on the last time step
+    #         logits = logits[:, -1, :] / temperature
+    #         # Apply softmax to get probabilities
+    #         probs = F.softmax(logits, dim=-1)
+    #         # Sample from the distribution
+    #         idx_next = torch.multinomial(probs, num_samples=1)
+    #         # Append sampled index to the running sequence
+    #         idx = torch.cat((idx, idx_next), dim=1)
+    #     return idx