Upload tokenizer

Browse files

Files changed (5) hide show

README.md +199 -0
special_tokens_map.json +7 -0
tokenizer.json +103 -0
tokenizer.py +315 -0
tokenizer_config.json +81 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "bos_token": "<bos>",
+  "eos_token": "<eos>",
+  "mask_token": "<mask>",
+  "pad_token": "<pad>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,103 @@

+{
+  "version": "1.0",
+  "truncation": null,
+  "padding": null,
+  "added_tokens": [
+    {
+      "id": 0,
+      "content": "<pad>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 1,
+      "content": "<unk>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 2,
+      "content": "<mask>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 3,
+      "content": "<bos>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 4,
+      "content": "<eos>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    }
+  ],
+  "normalizer": null,
+  "pre_tokenizer": {
+    "type": "Split",
+    "pattern": {
+      "String": ""
+    },
+    "behavior": "Removed",
+    "invert": false
+  },
+  "post_processor": null,
+  "decoder": null,
+  "model": {
+    "type": "WordPiece",
+    "unk_token": "<unk>",
+    "continuing_subword_prefix": "##",
+    "max_input_chars_per_word": 100,
+    "vocab": {
+      "<pad>": 0,
+      "<unk>": 1,
+      "<mask>": 2,
+      "<bos>": 3,
+      "<eos>": 4,
+      "|": 5,
+      "X": 6,
+      "B": 7,
+      "O": 8,
+      "U": 9,
+      "Z": 10,
+      "J": 11,
+      "L": 12,
+      "A": 13,
+      "G": 14,
+      "V": 15,
+      "S": 16,
+      "E": 17,
+      "R": 18,
+      "T": 19,
+      "I": 20,
+      "D": 21,
+      "P": 22,
+      "K": 23,
+      "Q": 24,
+      "N": 25,
+      "F": 26,
+      "Y": 27,
+      "M": 28,
+      "H": 29,
+      "W": 30,
+      "C": 31
+    }
+  }
+}

tokenizer.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import torch
+from typing import List, Optional, Union, Dict
+from torch import Tensor
+from itertools import compress
+# HuggingFace
+from tokenizers import Tokenizer
+from transformers import PreTrainedTokenizerFast, BatchEncoding
+from tokenizers.models import WordPiece
+from tokenizers.pre_tokenizers import Split
+VOCAB = {
+    "<pad>": 0,
+    "<unk>": 1,
+    "<mask>": 2,
+    "<bos>": 3,
+    "<eos>": 4,
+    "|": 5,
+    "X": 6,
+    "B": 7,
+    "O": 8,
+    "U": 9,
+    "Z": 10,
+    "J": 11,
+    "L": 12,
+    "A": 13,
+    "G": 14,
+    "V": 15,
+    "S": 16,
+    "E": 17,
+    "R": 18,
+    "T": 19,
+    "I": 20,
+    "D": 21,
+    "P": 22,
+    "K": 23,
+    "Q": 24,
+    "N": 25,
+    "F": 26,
+    "Y": 27,
+    "M": 28,
+    "H": 29,
+    "W": 30,
+    "C": 31,
+}
+class ProteinTokenizer(PreTrainedTokenizerFast):
+    def __init__(
+        self,
+        pad_token_id: int,
+        mask_token_id: int,
+        bos_token_id: int,
+        eos_token_id: int,
+        unk_token_id: int,
+        max_length: int,
+        other_special_token_ids: Optional[List[int]] = None,
+        ambiguous_token_ids: Optional[List[int]] = None,  # str = "XBOUZJ"
+        **kwargs,
+    ):
+        """Vocabulary comprising the amino acids, and the special tokens <unk>, <bos>, <eos>, <pad> and <mask>.
+        Args:
+            vocab_path (str): Path to the vocabulary file to load.
+            pad_token_id (int): <PAD> token index.
+            mask_token_id (int): <MASK> token index.
+            bos_token_id (int): <BOS> token index.
+            eos_token_id (int): <EOS> token index.
+            unk_token_id (int): <UNK> token index.
+            other_special_token_ids (Optional[List[int]]): List of additional special tokens.
+        """
+        # Create vocabulary with special tokens
+        token_to_id = dict()
+        id_to_token = dict()
+        for token, token_id in VOCAB.items():
+            token = token.strip()
+            token_to_id[token] = token_id
+            id_to_token[token_id] = token
+        # Define tokenizer and model
+        tokenizer_object = Tokenizer(WordPiece(vocab=token_to_id, unk_token=id_to_token.get(unk_token_id)))
+        # Pretokenize by splitting every character
+        tokenizer_object.pre_tokenizer = Split("", behavior="removed")
+        super().__init__(
+            pad_token_id=pad_token_id,
+            mask_token_id=mask_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            unk_token_id=unk_token_id,
+            pad_token=id_to_token.get(pad_token_id),
+            bos_token=id_to_token.get(bos_token_id),
+            eos_token=id_to_token.get(eos_token_id),
+            unk_token=id_to_token.get(unk_token_id),
+            mask_token=id_to_token.get(mask_token_id),
+            max_length=max_length,
+            ambiguous_token_ids=ambiguous_token_ids,
+            model_max_length=max_length,
+            padding_side="right",
+            truncation_side="right",
+            model_input_names=["input_ids", "attention_mask", "special_tokens_mask"],
+            tokenizer_object=tokenizer_object,
+        )
+        if other_special_token_ids is not None:
+            self.add_special_tokens({"additional_special_tokens": list(id_to_token.get(i) for i in other_special_token_ids)})
+        self.ambiguous_token_ids = ambiguous_token_ids
+        self.key_to_padding = {"input_ids": self.pad_token_id, "attention_mask": 0, "special_tokens_mask": 1, "position_ids": 0}
+        self.key_to_dtype = {
+            "input_ids": torch.long,
+            "attention_mask": torch.bool,
+            "special_tokens_mask": torch.bool,
+            "position_ids": torch.int,
+        }
+    def truncate(
+        self,
+        encoded_inputs: Dict[str, List[int]],
+        max_length: Optional[int] = None,
+        random_truncate: bool = True,
+    ) -> Dict[str, List[List[int]]]:
+        """
+        Randomly truncate sequences in encoded inputs to the specified maximum length.
+        Args:
+            encoded_inputs (BatchEncoding): Tokenized inputs with keys like 'input_ids' as tensors.
+            max_length (Optional[int]): Maximum length for truncation. Defaults to model's max length if None.
+            random_truncate (bool): Whether to randomly truncate sequences.
+        Returns:
+            Dict[str, List[List[int]]]: Randomly truncated tokenized inputs.
+        """
+        for i, sequence in enumerate(encoded_inputs["input_ids"]):
+            if len(sequence) > max_length:
+                if random_truncate:
+                    offset = torch.randint(0, len(sequence) - max_length + 1, (1,)).item()
+                else:
+                    offset = 0
+                for key in encoded_inputs:
+                    encoded_inputs[key][i] = encoded_inputs[key][i][offset : offset + max_length]
+        # add option for different random truncate
+        return encoded_inputs
+    def remove_ambiguous(self, encoded_inputs: Dict[str, List[int]]) -> Dict[str, List[List[int]]]:
+        """
+        Remove ambiguous amino acids from the input sequences.
+        Args:
+            encoded_inputs (BatchEncoding): Tokenized inputs with keys like 'input_ids' as tensors.
+        Returns:
+            Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids.
+        """
+        filtered_inputs = {key: [] for key in encoded_inputs}
+        for i, sequence in enumerate(encoded_inputs["input_ids"]):
+            mask = [token not in self.ambiguous_token_ids for token in sequence]
+            # Drop the sequence entirely if there is only ambiguous tokens
+            if not any(mask):
+                continue
+            # Otherwise remove only the ambiguous tokens
+            for key in encoded_inputs:
+                filtered_inputs[key].append(list(compress(encoded_inputs[key][i], mask)))
+        return filtered_inputs
+    def _pad(
+        self,
+        encoded_inputs: Dict[str, List[List[int]]],
+        padding: Union[bool, str] = True,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: int = 8,
+        **kwargs,
+    ) -> Dict[str, List[List[int]]]:
+        """
+        Remove ambiguous amino acids from the input sequences.
+        Args:
+            encoded_inputs (Dict[str, List[List[int]]): Tokenized inputs with keys like 'input_ids' as tensors.
+        Returns:
+            Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids.
+        """
+        if isinstance(encoded_inputs, list):
+            tmp = dict()
+            for key in encoded_inputs[0]:
+                tmp[key] = [encoded_inputs[i][key] for i in range(len(encoded_inputs))]
+            encoded_inputs = tmp
+        if max_length is None:
+            max_length = self.model_max_length
+        sequence_lengths = [len(sequence) for sequence in encoded_inputs["input_ids"]]
+        if padding == "longest" or padding == True:
+            max_length = min(max_length, max(sequence_lengths))
+        if pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+        for i, seq_len in enumerate(sequence_lengths):
+            if seq_len < max_length:
+                for key in encoded_inputs:
+                    encoded_inputs[key][i] = encoded_inputs[key][i] + [self.key_to_padding[key]] * (max_length - seq_len)
+        return encoded_inputs
+    def pad(
+        self,
+        encoded_inputs: Dict[str, List[List[int]]],
+        padding: Union[bool, str] = True,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: int = 8,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> Dict[str, List[List[int]]]:
+        """
+        Remove ambiguous amino acids from the input sequences.
+        Args:
+            encoded_inputs (Dict[str, List[List[int]]): Tokenized inputs with keys like 'input_ids' as tensors.
+        Returns:
+            Dict[str, List[List[int]]]: Tokenized inputs without ambiguous amino acids.
+        """
+        encoded_inputs = self._pad(
+            encoded_inputs,
+            padding,
+            max_length,
+            pad_to_multiple_of,
+            **kwargs,
+        )
+        if return_tensors is not None:
+            return BatchEncoding(encoded_inputs, tensor_type=return_tensors)
+        return encoded_inputs
+    def __call__(
+        self,
+        text: str | List[str],
+        max_length: Optional[int] = None,
+        padding: Union[bool, str] = False,
+        truncation: bool = False,
+        random_truncate: bool = True,
+        remove_ambiguous: bool = False,
+        return_special_tokens_mask: bool = True,
+        return_tensors: str = None,
+        **kwargs,
+    ) -> Dict[str, Tensor]:
+        if isinstance(text, str):
+            encoded_inputs = self.__call__(
+                [text],
+                max_length,
+                padding,
+                truncation,
+                random_truncate,
+                remove_ambiguous,
+                return_special_tokens_mask,
+                return_tensors,
+            )
+            for key in encoded_inputs:
+                encoded_inputs[key] = encoded_inputs[key][0]
+            return encoded_inputs
+        # Tokenize without truncation or padding
+        encoded_inputs = super().__call__(
+            text,
+            padding=False,
+            truncation=False,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+        if max_length is None:
+            max_length = self.model_max_length
+        # Truncate
+        if truncation:
+            encoded_inputs = self.truncate(
+                encoded_inputs,
+                max_length=max_length,
+                random_truncate=random_truncate,
+            )
+        ## NOTE: Moved this to after truncation to avoid the offset when random truncation is used
+        # Track original position indexes
+        encoded_inputs["position_ids"] = [list(range(len(seq))) for seq in encoded_inputs["input_ids"]]
+        # Remove ambiguous amino acids
+        if remove_ambiguous and self.ambiguous_token_ids is not None:
+            encoded_inputs = self.remove_ambiguous(encoded_inputs)
+        # Add padding
+        if padding:
+            encoded_inputs = self._pad(encoded_inputs, max_length=max_length, return_tensors=return_tensors)
+        if return_tensors is not None:
+            return BatchEncoding(encoded_inputs, tensor_type=return_tensors)
+        return encoded_inputs

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<bos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<eos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "ambiguous_token_ids": [
+    1,
+    6,
+    7,
+    8,
+    9,
+    10,
+    11
+  ],
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenizer.ProteinTokenizer",
+      null
+    ]
+  },
+  "bos_token": "<bos>",
+  "bos_token_id": 3,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<eos>",
+  "eos_token_id": 4,
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "mask_token_id": 2,
+  "max_length": 2048,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask",
+    "special_tokens_mask"
+  ],
+  "model_max_length": 2048,
+  "pad_token": "<pad>",
+  "pad_token_id": 0,
+  "padding_side": "right",
+  "tokenizer_class": "ProteinTokenizer",
+  "truncation_side": "right",
+  "unk_token": "<unk>",
+  "unk_token_id": 1
+}