Upload Seq2SeqCrossFormer

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +25 -0
generation_config.json +7 -0
hf_transformer.py +379 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "Seq2SeqCrossFormer"
+  ],
+  "auto_map": {
+    "AutoModel": "hf_transformer.Seq2SeqCrossFormer"
+  },
+  "bos_token_id": 1,
+  "d_ff": 2048,
+  "d_model": 512,
+  "dropout": 0.1,
+  "eos_token_id": 2,
+  "model_type": "custom_code",
+  "n_heads": 8,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "router_dim": 10,
+  "sequence_length": 8192,
+  "source_sequence_dimension": 70,
+  "target_sequence_dimension": 306,
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.1",
+  "vocab_size_src": 258,
+  "vocab_size_tgt": 258
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "transformers_version": "4.48.1"
+}

hf_transformer.py ADDED Viewed

	@@ -0,0 +1,379 @@

+from transformers import PreTrainedModel, PretrainedConfig
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from model.architectures.transformer import EncoderDecoderTransformer
+from model.architectures.crossformer import EncoderDecoderCrossFormer
+from model.hf_configs import Seq2SeqConfig, Seq2SeqCrossConfig
+from einops import rearrange
+class Seq2SeqTransformer(PreTrainedModel):
+    """
+    Custom Transformer for Sequence to Sequence tasks.
+    """
+    config_class = Seq2SeqConfig
+    base_model_prefix = "transformer"
+    def __init__(self, config: PretrainedConfig, device: Optional[str]=None):
+        super().__init__(config)
+        self.softmax = nn.Softmax(dim=-1)
+        self.transformer = EncoderDecoderTransformer(
+            src_vocab_size=config.vocab_size_src,
+            tgt_vocab_size=config.vocab_size_tgt,
+            embed_dim=config.d_model,
+            num_heads=config.n_heads,
+            ff_dim=config.d_ff,
+            num_encoder_layers=config.n_layers,
+            num_decoder_layers=config.n_layers,
+            max_seq_length=config.sequence_length
+        )
+        # Initialize weights
+        self.transformer.apply(self._init_weights)
+    def _init_weights(self, module: nn.Module):
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight)
+            if module.bias is not None:
+                nn.init.constant_(module.bias, 0)
+    def _create_padding_mask(self, ids: torch.LongTensor) -> torch.DoubleTensor:
+        """Creates a mask to avoid padded tokens to be interfering with attention"""
+        # First create boolean mask where True = padding token
+        is_padding = ids.eq(self.config.pad_token_id)
+        # Convert to float and replace padding positions with -inf, others with 1.0
+        mask = is_padding.float()
+        mask = mask.masked_fill(is_padding, float('-inf'))
+        mask = mask.masked_fill(~is_padding, 1.0)
+        return mask
+    def _shift_right(self, x: torch.LongTensor) -> torch.LongTensor:
+        """Helper method to prepare decoder inputs (teacher forcing) by shifting right label tokens"""
+        shifted = torch.full(
+            (*x.shape[:-1], 1),
+            self.config.bos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        shifted = torch.cat([shifted, x[:, :-1]], dim=-1)
+        return shifted
+    def _add_beginning_of_stream(self, x: torch.LongTensor) -> torch.LongTensor:
+        """
+        Helper method to add BOS token to the beginning of input sequences
+        """
+        bos = torch.full(
+            (*x.shape[:-1], 1),
+            self.config.bos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        return torch.cat([bos, x], dim=-1)
+    def _add_end_of_stream(self, x: torch.LongTensor) -> torch.LongTensor:
+        """Helper method to add EOS token to the end of label sequences"""
+        eos = torch.full(
+            (*x.shape[:-1], 1),
+            self.config.eos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        return torch.cat([x, eos], dim=-1)
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        labels: Optional[torch.LongTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        decoder_attention_mask: Optional[torch.BoolTensor] = None,
+        **kwargs
+    ) -> Union[Tuple, dict]:
+        # TODO: add/end of streaming and right shift should take place outside of the model in tokenizer
+        # adding beginning of stream tokens to input too
+        input_ids = self._add_beginning_of_stream(input_ids)
+        # adding end of stream tokens to labels
+        labels = self._add_end_of_stream(labels)
+        # Prepare input for the decoder
+        if decoder_input_ids is None and labels is not None:
+            decoder_input_ids = self._shift_right(labels)
+        src_key_padding_mask = self._create_padding_mask(input_ids)
+        tgt_key_padding_mask = self._create_padding_mask(decoder_input_ids)
+        # Forward pass through your model
+        outputs = self.transformer(
+            src=input_ids,
+            tgt=decoder_input_ids,
+            src_mask=attention_mask,
+            tgt_mask=decoder_attention_mask,
+            src_key_padding_mask=src_key_padding_mask,
+            tgt_key_padding_mask=tgt_key_padding_mask
+        )
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=self.config.pad_token_id)
+            loss = loss_fct(outputs.view(-1, self.config.vocab_size_tgt), labels.view(-1))
+        return dict(
+            loss=loss,
+            logits=outputs,
+        )
+    def generate(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        max_length: Optional[int] = None,
+        temperature: float = 1.0,
+        do_sample: bool = False,
+        **kwargs
+    ) -> torch.LongTensor:
+        batch_size = input_ids.shape[0]
+        max_length = max_length or self.config.max_length or 128
+        decoder_input_ids = torch.full(
+            (batch_size, 1),
+            self.config.bos_token_id,
+            dtype=torch.long,
+            device=input_ids.device
+        )
+        for _ in range(max_length - 1):
+            outputs = self.forward(
+                input_ids=input_ids,
+                decoder_input_ids=decoder_input_ids,
+                attention_mask=attention_mask,
+            )
+            next_token_logits = outputs["logits"][:, -1, :]
+            if do_sample:
+                # Apply temperature scaling
+                scaled_logits = next_token_logits / temperature
+                # Convert to probabilities
+                next_token_probs = self.softmax(scaled_logits)
+                # Sample from the probability distribution
+                next_token = torch.multinomial(
+                    next_token_probs, num_samples=1
+                ).squeeze(-1)
+            else:
+                # Greedy decoding
+                next_token = next_token_logits.argmax(dim=-1)
+            decoder_input_ids = torch.cat(
+                [decoder_input_ids, next_token.unsqueeze(-1)],
+                dim=-1
+            )
+            # Stop if all sequences have generated EOS token
+            if (decoder_input_ids == self.config.eos_token_id).any(dim=-1).all():
+                break
+        return decoder_input_ids
+class Seq2SeqCrossFormer(Seq2SeqTransformer):
+    """CrossFormer wrapper predicting over a discrete vocabulatory."""
+    config_class = Seq2SeqCrossConfig
+    def __init__(self, config: PretrainedConfig):
+        super().__init__(config)
+        self.softmax = nn.Softmax(dim=-1)
+        self.transformer = EncoderDecoderCrossFormer(
+            source_sequence_dimension=config.source_sequence_dimension,
+            target_sequence_dimension=config.target_sequence_dimension,
+            router_dim=config.router_dim,
+            src_vocab_size=config.vocab_size_src,
+            tgt_vocab_size=config.vocab_size_tgt,
+            embed_dim=config.d_model,
+            num_heads=config.n_heads,
+            ff_dim=config.d_ff,
+            num_encoder_layers=config.n_layers,
+            num_decoder_layers=config.n_layers,
+            max_seq_length=config.sequence_length
+        )
+        # Initialize weights
+        self.transformer.apply(self._init_weights)
+    def _shift_right(self, x: torch.LongTensor) -> torch.LongTensor:
+        """
+        Helper method to prepare decoder inputs (teacher forcing) by shifting right label tokens.
+        Handles 3D (B, S, C) tensors
+        """
+        # Create shape that matches x's dimensions except for seq_len which will be 1
+        shape = list(x.shape)
+        shape[-2] = 1  # Set sequence dimension to 1
+        shifted = torch.full(
+            shape,
+            self.config.bos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        shifted = torch.cat([shifted, x[..., :-1, :]], dim=-2)
+        return shifted
+    def _add_beginning_of_stream(self, x: torch.LongTensor) -> torch.LongTensor:
+        """
+        Helper method to add BOS token to the beginning of input sequences.
+        Handles 3D (B, S, C) tensors
+        """
+        shape = list(x.shape)
+        shape[-2] = 1  # Set sequence dimension to 1
+        sos = torch.full(
+            shape,
+            self.config.bos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        return torch.cat([sos, x], dim=-2)
+    def _add_end_of_stream(self, x: torch.LongTensor) -> torch.LongTensor:
+        """
+        Helper method to add EOS token to the end of label sequences.
+        Handles 3D (B, S, C) tensors
+        """
+        # Create shape that matches x's dimensions except for seq_len which will be 1
+        shape = list(x.shape)
+        shape[-2] = 1  # Set sequence dimension to 1
+        eos = torch.full(
+            shape,
+            self.config.eos_token_id,
+            dtype=x.dtype,
+            device=x.device
+        )
+        return torch.cat([x, eos], dim=-2)
+    def forward(
+            self,
+            input_ids: torch.LongTensor,
+            labels: Optional[torch.LongTensor] = None,
+            decoder_input_ids: Optional[torch.LongTensor] = None,
+            **kwargs
+            ):
+        # FIXME: add/end of streaming and right shift should take place outside of the model in tokenizer
+        # (in tokenizer) adding beginning of stream tokens to input too
+        input_ids = self._add_beginning_of_stream(input_ids)
+        # (in tokenizer) adding end of stream tokens to labels
+        labels = self._add_end_of_stream(labels)
+        # Prepare input for the decoder
+        if decoder_input_ids is None and labels is not None:
+            decoder_input_ids = self._shift_right(labels)
+        src_src_key_padding_time_mask = rearrange(
+            self._create_padding_mask(
+                input_ids
+            ),
+            'b s c -> (b c) s'
+        )
+        tgt_tgt_key_padding_time_mask = rearrange(
+            self._create_padding_mask(
+                decoder_input_ids
+            ),
+            'b s c -> (b c) s'
+        )
+        # Forward pass through your model
+        outputs = self.transformer(
+            src=input_ids,
+            tgt=decoder_input_ids,
+            src_src_time_mask=kwargs.get("src_src_time_mask"),
+            src_src_dimension_mask=kwargs.get("src_src_dimension_mask"),
+            src_src_key_padding_time_mask=src_src_key_padding_time_mask,
+            tgt_tgt_time_mask=kwargs.get("tgt_tgt_time_mask"),
+            tgt_tgt_dimension_mask=kwargs.get("tgt_tgt_dimension_mask"),
+            tgt_tgt_key_padding_time_mask=tgt_tgt_key_padding_time_mask,
+            tgt_src_dimension_mask=kwargs.get("tgt_src_dimension_mask")
+        )
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(
+                ignore_index=self.config.pad_token_id
+            )
+            loss = loss_fct(
+                outputs.view(-1, self.config.vocab_size_tgt), labels.view(-1)
+            )
+        return dict(
+            loss=loss,
+            logits=outputs,
+        )
+    def generate(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor]=None,
+        max_length: Optional[int]=None,
+        temperature: float=1.0,
+        do_sample: bool=False,
+        **kwargs
+    ) -> torch.LongTensor:
+        batch_size, timesteps, channels = input_ids.shape
+        src_key_padding_mask = self._create_padding_mask(input_ids)
+        max_length = max_length or self.config.max_length or 128
+        decoder_input_ids = torch.full(
+            input_ids.shape,
+            self.config.pad_token_id,
+            dtype=torch.long,
+            device=input_ids.device
+        )
+        # Set BOS token at the start
+        decoder_input_ids[:, 0, :] = self.config.bos_token_id
+        for t in range(timesteps + max_length):
+            outputs = self.forward(
+                input_ids=input_ids,
+                decoder_input_ids=decoder_input_ids,
+                attention_mask=attention_mask
+            )
+            # Get predictions for this timestep
+            next_token_logits = outputs["logits"][:, t, :]
+            if do_sample:
+                scaled_logits = next_token_logits / temperature
+                next_token_probs = self.softmax(scaled_logits)
+                next_token = torch.multinomial(
+                    next_token_probs, num_samples=1
+                ).squeeze(-1)
+            else:
+                next_token = next_token_logits.argmax(dim=-1)
+            # Place the predicted token at position t
+            decoder_input_ids[:, t, :] = next_token
+            # Check if all sequences have generated EOS token
+            if (next_token == self.config.eos_token_id).all():
+                break
+            decoder_input_ids = decoder_input_ids[:, -timesteps:, :]
+        return decoder_input_ids
+# AutoConfig.register("custom_code", Seq2SeqConfig)
+# AutoConfig.register("custom_code", Seq2SeqCrossConfig)
+# AutoModel.register(Seq2SeqConfig, Seq2SeqTransformer)
+# AutoModel.register(Seq2SeqCrossConfig, Seq2SeqCrossFormer)
+# model = AutoModel.from_pretrained("fracapuano/bwaves")

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4151898e84382b24896b3ff258f289a1d77480c7ab7743d19c7e3d3fce724a98
+size 2393519960