Initial upload

Browse files

Files changed (9) hide show

README.md +134 -0
common.py +33 -0
config.json +156 -0
model.safetensors +3 -0
modeling_m5_encoder.py +922 -0
prepare_data.py +147 -0
special_tokens_map.json +6 -0
tokenizer.json +1139 -0
tokenizer_config.json +44 -0

README.md ADDED Viewed

	@@ -0,0 +1,134 @@

+---
+library_name: transformers
+tags:
+  - chemistry
+  - molecular-property-prediction
+  - selfies
+  - encoder
+license: apache-2.0
+---
+# M5 Encoder
+A SELFIES-based molecular encoder built on a T5 backbone with custom
+distance-aware relative position encodings. Two classes are available:
+| Class | Description |
+|---|---|
+| `M5Encoder` | Bare encoder, outputs `last_hidden_state` |
+| `M5ModelForRegression` | Encoder + sequence-level and token-level regression heads|
+The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks
+from the [PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).
+## Usage
+```python
+from transformers import AutoConfig, AutoModel
+config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
+model  = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
+```
+To load `M5ModelForRegression` explicitly:
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained(
+    "IlPakoZ/m5-encoder", trust_remote_code=True
+)
+```
+## Architecture
+| Hyper-parameter | Value |
+|---|---|
+| `d_model` | 512 |
+| `d_ff` | 2048 |
+| `d_kv` | 64 |
+| `num_layers` | 24 |
+| `num_heads` | 12 |
+| `vocab_size` | 1 032 |
+| `feed_forward_proj` | gated-gelu |
+| `relative_attention_num_buckets` | 48 |
+| `relative_attention_max_distance` | 128 |
+Position biases are replaced by molecular-graph distances computed
+with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.
+## Tasks
+Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:
+### Group 0 — General molecular descriptors (RDKit)
+| Task | Description |
+|---|---|
+| `MW` | Molecular weight |
+| `TDM` | Total dipole moment |
+### Group 1 — Physicochemical properties (RDKit)
+| Task | Description |
+|---|---|
+| `MolLogP` | Wildman-Crippen LogP estimate |
+| `MolMR` | Wildman-Crippen molar refractivity |
+| `TPSA` | Topological polar surface area |
+| `FractionCSP3` | Fraction of sp³ carbons |
+### Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)
+Alpha and beta spin-orbital energies from DFT calculations:
+| Task | Description |
+|---|---|
+| `energy_alpha_homo` | Alpha HOMO energy |
+| `energy_alpha_gap` | Alpha HOMO–LUMO gap |
+| `energy_alpha_lumo` | Alpha LUMO energy |
+| `energy_beta_homo` | Beta HOMO energy |
+| `energy_beta_gap` | Beta HOMO–LUMO gap |
+| `energy_beta_lumo` | Beta LUMO energy |
+### Group 3 — Orbital energies (PubChemQC B3LYP/PM6)
+50 linearly sampled energies (`orbital_0` … `orbital_49`) spanning each molecule's full orbital spectrum, predicted at the sequence level.
+### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
+Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset.
+## Dataset
+The model is pretrained on a processed version of the
+[PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).
+The raw database exposes a `b3lyp_pm6` table (columns: `cid`, `state`, `data` as JSON). Data was extracted,
+invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate
+SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers
+and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural
+representations. Molecules incompatible with strict SELFIES encoding were discarded.
+The processed dataset contains **82,686,706 SMILES sequences**, each paired with a full set of labels across all tasks. It is split by scaffold:
+| Split | Sequences | Tokens (approx.) |
+|---|---|---|
+| Train | 66,149,364 | ~2.5 B (×2 with augmentation → ~5 B) |
+| Validation | 8,268,673 | — |
+| Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
+Training augmentation generates randomized SELFIES on the fly from each SMILES. Labels are normalized before training.
+The HDF5 files are available for download below. These are intended to be processed with the bundled `data_processing` library into LMDB datasets optimised for fast training throughput; the resulting LMDB files are too large to distribute directly.
+| Split | Download |
+|---|---|
+| Train | [train.h5](#) |
+| Validation | [validation.h5](#) |
+| Test | [test.h5](#) |
+## Limitations
+- **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix. Consequently, molecules whose SELFIES tokenization exceeds **32,767 tokens** (`numpy.iinfo(numpy.int16).max`) are not supported. In practice, no molecule in the training dataset approaches this limit.
+- **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
+- **Scope:** The model is pretrained on organic molecules present in PubChemQC. Performance on inorganic compounds, organometallics, or very large macromolecules outside the training distribution has not been evaluated.

common.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers.models.t5.configuration_t5 import T5Config
+class M5Pooler(nn.Module):
+    def __init__(self, config: T5Config):
+        super().__init__()
+        self.pool_weights = nn.Parameter(torch.tensor([0.5, 0.5]))
+        self.pad_token_id = config.pad_token_id
+    def forward(self, input_ids: torch.Tensor, hidden_states: torch.Tensor) -> torch.Tensor:
+        mask = (input_ids[:, 1:] != self.pad_token_id).unsqueeze(-1).float()  # [batch, seq_len, 1]
+        atoms = hidden_states[:, 1:, :]
+        # Zero out padding token embeddings
+        masked_embedded = atoms * mask  # [batch, seq_len, hidden_dim]
+        # Sum and divide by number of real tokens
+        sum_embedded = masked_embedded.sum(dim=1)  # [batch, hidden_dim]
+        num_real_tokens = mask.sum(dim=1).clamp(min=1e-9)  # [batch, 1], avoid division by zero
+        mean_pool = sum_embedded / num_real_tokens  # [batch, hidden_dim]
+        cls_token = hidden_states[:, 0, :]
+        # Learned weights for weighted average between CLS and non CLS tokens
+        weights = F.softmax(self.pool_weights, dim=0)
+        pooled = weights[0] * mean_pool + weights[1] * cls_token
+        return pooled

config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "architectures": [
+    "M5ModelForRegression"
+  ],
+  "classifier_dropout": 0,
+  "d_ff": 2048,
+  "d_kv": 64,
+  "d_model": 512,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0,
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13",
+    "14": "LABEL_14",
+    "15": "LABEL_15",
+    "16": "LABEL_16",
+    "17": "LABEL_17",
+    "18": "LABEL_18",
+    "19": "LABEL_19",
+    "20": "LABEL_20",
+    "21": "LABEL_21",
+    "22": "LABEL_22",
+    "23": "LABEL_23",
+    "24": "LABEL_24",
+    "25": "LABEL_25",
+    "26": "LABEL_26",
+    "27": "LABEL_27",
+    "28": "LABEL_28",
+    "29": "LABEL_29",
+    "30": "LABEL_30",
+    "31": "LABEL_31",
+    "32": "LABEL_32",
+    "33": "LABEL_33",
+    "34": "LABEL_34",
+    "35": "LABEL_35",
+    "36": "LABEL_36",
+    "37": "LABEL_37",
+    "38": "LABEL_38",
+    "39": "LABEL_39",
+    "40": "LABEL_40",
+    "41": "LABEL_41",
+    "42": "LABEL_42",
+    "43": "LABEL_43",
+    "44": "LABEL_44",
+    "45": "LABEL_45",
+    "46": "LABEL_46",
+    "47": "LABEL_47",
+    "48": "LABEL_48",
+    "49": "LABEL_49",
+    "50": "LABEL_50",
+    "51": "LABEL_51",
+    "52": "LABEL_52",
+    "53": "LABEL_53",
+    "54": "LABEL_54",
+    "55": "LABEL_55",
+    "56": "LABEL_56",
+    "57": "LABEL_57",
+    "58": "LABEL_58",
+    "59": "LABEL_59",
+    "60": "LABEL_60",
+    "61": "LABEL_61"
+  },
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": false,
+  "is_gated_act": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_11": 11,
+    "LABEL_12": 12,
+    "LABEL_13": 13,
+    "LABEL_14": 14,
+    "LABEL_15": 15,
+    "LABEL_16": 16,
+    "LABEL_17": 17,
+    "LABEL_18": 18,
+    "LABEL_19": 19,
+    "LABEL_2": 2,
+    "LABEL_20": 20,
+    "LABEL_21": 21,
+    "LABEL_22": 22,
+    "LABEL_23": 23,
+    "LABEL_24": 24,
+    "LABEL_25": 25,
+    "LABEL_26": 26,
+    "LABEL_27": 27,
+    "LABEL_28": 28,
+    "LABEL_29": 29,
+    "LABEL_3": 3,
+    "LABEL_30": 30,
+    "LABEL_31": 31,
+    "LABEL_32": 32,
+    "LABEL_33": 33,
+    "LABEL_34": 34,
+    "LABEL_35": 35,
+    "LABEL_36": 36,
+    "LABEL_37": 37,
+    "LABEL_38": 38,
+    "LABEL_39": 39,
+    "LABEL_4": 4,
+    "LABEL_40": 40,
+    "LABEL_41": 41,
+    "LABEL_42": 42,
+    "LABEL_43": 43,
+    "LABEL_44": 44,
+    "LABEL_45": 45,
+    "LABEL_46": 46,
+    "LABEL_47": 47,
+    "LABEL_48": 48,
+    "LABEL_49": 49,
+    "LABEL_5": 5,
+    "LABEL_50": 50,
+    "LABEL_51": 51,
+    "LABEL_52": 52,
+    "LABEL_53": 53,
+    "LABEL_54": 54,
+    "LABEL_55": 55,
+    "LABEL_56": 56,
+    "LABEL_57": 57,
+    "LABEL_58": 58,
+    "LABEL_59": 59,
+    "LABEL_6": 6,
+    "LABEL_60": 60,
+    "LABEL_61": 61,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
+  },
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "m5_model",
+  "num_decoder_layers": 24,
+  "num_heads": 12,
+  "num_layers": 24,
+  "pad_token_id": 2,
+  "relative_attention_max_distance": 96,
+  "relative_attention_num_buckets": 32,
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "use_cache": false,
+  "vocab_size": 1032
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eac7062b1d66d0ad63fff0f71e8f86d7cc86397d1c6783ee3099bcaf1237027d
+size 497310076

modeling_m5_encoder.py ADDED Viewed

	@@ -0,0 +1,922 @@

+import torch
+import numpy as np
+import math
+import logging
+from typing import Optional, Union
+import torch.nn as nn
+from transformers import PreTrainedModel, T5EncoderModel, T5ForConditionalGeneration, T5ForQuestionAnswering, T5ForTokenClassification, T5Model, load_tf_weights_in_t5
+from torch import nn
+from transformers.models.t5.modeling_t5 import T5Attention, T5DenseActDense, T5DenseGatedActDense, T5ClassificationHead, T5LayerNorm, T5Stack, T5Block, T5LayerSelfAttention, T5LayerFF
+from transformers.cache_utils import DynamicCache, EncoderDecoderCache
+from transformers.models.t5.configuration_t5 import T5Config
+from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, BaseModelOutput
+from transformers.utils import DUMMY_INPUTS, DUMMY_MASK, is_torch_fx_proxy, is_torchdynamo_compiling
+from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, BaseModelOutput
+from transformers.utils.deprecation import deprecate_kwarg
+from .common import M5Pooler
+from .prepare_data import get_positional_encodings_and_align
+logger = logging.getLogger(__name__)
+class M5EncoderConfig(T5Config):
+    model_type = "m5_model"
+    def __init__(
+        self,
+        d_ff= 2048,
+        d_kv = 64,
+        d_model = 512,
+        num_layers = 24,
+        num_heads = 12,
+        pad_token_id = 2,
+        dropout_rate = 0,
+        feed_forward_proj = "gated-gelu",
+        classifier_dropout=0,
+        relative_attention_max_distance=128,
+        relative_attention_num_buckets=48,
+        vocab_size=1032,
+        **kwargs,
+    ):
+        super().__init__(d_ff=d_ff,
+                         d_kv=d_kv,
+                         d_model=d_model,
+                         num_layers=num_layers,
+                         num_heads=num_heads,
+                         pad_token_id=pad_token_id,
+                         dropout_rate=dropout_rate,
+                         feed_forward_proj=feed_forward_proj,
+                         classifier_dropout=classifier_dropout,
+                         relative_attention_max_distance=relative_attention_max_distance,
+                         relative_attention_num_buckets=relative_attention_num_buckets,
+                         vocab_size=vocab_size,
+                         **kwargs)
+class M5Encoder(PreTrainedModel):
+    config_class = M5EncoderConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = M5EncoderModel(config)
+        #self.model = torch.compile(self.model, mode="max-autotune", fullgraph=True)
+    def forward(self, input_ids, attention_mask=None, relative_position=None, **kwargs):
+        return self.model(input_ids=input_ids,
+            attention_mask=attention_mask,
+            relative_position=relative_position)
+    def get_positional_embeddings_and_align(self, smiles, token_regr, seed):
+        return get_positional_encodings_and_align(smiles, token_regr, seed)
+class M5EncoderModel(T5EncoderModel):
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        encoder_config = config
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = M5Stack(encoder_config, self.shared)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+            self,
+            input_ids: Optional[torch.LongTensor] = None,
+            attention_mask: Optional[torch.FloatTensor] = None,
+            head_mask: Optional[torch.FloatTensor] = None,
+            inputs_embeds: Optional[torch.FloatTensor] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            relative_position: Optional[torch.LongTensor] = None
+        ) -> Union[tuple[torch.FloatTensor], BaseModelOutput]:
+            r"""
+            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you
+                should be able to pad the inputs on both the right and the left.
+                Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+                [`PreTrainedTokenizer.__call__`] for detail.
+                To know more on how to prepare `input_ids` for pretraining take a look a [T5 Training](./t5#training).
+            Example:
+            ```python
+            >>> from transformers import AutoTokenizer, T5EncoderModel
+            >>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
+            >>> model = T5EncoderModel.from_pretrained("google-t5/t5-small")
+            >>> input_ids = tokenizer(
+            ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
+            ... ).input_ids  # Batch size 1
+            >>> outputs = model(input_ids=input_ids)
+            >>> last_hidden_states = outputs.last_hidden_state
+            ```"""
+            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+                relative_position=relative_position
+            )
+            return encoder_outputs
+class M5Stack(T5Stack):
+    def __init__(self, config, embed_tokens=None):
+        super().__init__(config, embed_tokens)
+        self.block = nn.ModuleList(
+            [M5Block(config, has_relative_attention_bias=bool(i == 0), layer_idx=i) for i in range(config.num_layers)]
+        )
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        head_mask=None,
+        cross_attn_head_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        cache_position=None,
+        relative_position=None
+    ):
+        # Model parallel
+        if self.model_parallel:
+            torch.cuda.set_device(self.first_device)
+            self.embed_tokens = self.embed_tokens.to(self.first_device)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None and inputs_embeds is not None:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You cannot specify both {err_msg_prefix}input_ids and {err_msg_prefix}inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            input_ids = input_ids.view(-1, input_shape[-1])
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(f"You have to specify either {err_msg_prefix}input_ids or {err_msg_prefix}inputs_embeds")
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        if inputs_embeds is None:
+            if self.embed_tokens is None:
+                raise ValueError("You have to initialize the model with valid token embeddings")
+            inputs_embeds = self.embed_tokens(input_ids)
+        batch_size, seq_length = input_shape
+        if use_cache is True:
+            if not self.is_decoder:
+                raise ValueError(f"`use_cache` can only be set to `True` if {self} is used as a decoder")
+        if self.is_decoder:
+            if use_cache and past_key_values is None:
+                if self.config.is_encoder_decoder:
+                    past_key_values = EncoderDecoderCache(
+                        DynamicCache(config=self.config), DynamicCache(config=self.config)
+                    )
+                else:
+                    past_key_values = DynamicCache(config=self.config)
+        elif not self.is_decoder:
+            # do not pass cache object down the line for encoder stack
+            # it messes indexing later in decoder-stack because cache object is modified in-place
+            past_key_values = None
+        past_key_values_length = past_key_values.get_seq_length() if past_key_values is not None else 0
+        if cache_position is None:
+            cache_position = torch.arange(
+                past_key_values_length, past_key_values_length + seq_length, device=inputs_embeds.device
+            )
+        if attention_mask is None and not is_torchdynamo_compiling():
+            # required mask seq length can be calculated via length of past cache
+            mask_seq_length = past_key_values_length + seq_length
+            attention_mask = torch.ones(batch_size, mask_seq_length, device=inputs_embeds.device)
+        if self.config.is_decoder:
+            causal_mask = self._update_causal_mask(
+                attention_mask,
+                inputs_embeds,
+                cache_position,
+                past_key_values.self_attention_cache
+                if isinstance(past_key_values, EncoderDecoderCache)
+                else past_key_values,
+                output_attentions,
+            )
+        elif attention_mask is not None:
+            causal_mask = attention_mask[:, None, None, :]
+            causal_mask = causal_mask.to(dtype=inputs_embeds.dtype)
+            causal_mask = (1.0 - causal_mask) * torch.finfo(inputs_embeds.dtype).min
+        else:
+            causal_mask = None
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=inputs_embeds.device, dtype=torch.long
+                )
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+        # Prepare head mask if needed
+        head_mask = self.get_head_mask(head_mask, self.config.num_layers)
+        cross_attn_head_mask = self.get_head_mask(cross_attn_head_mask, self.config.num_layers)
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+        position_bias = None
+        encoder_decoder_position_bias = None
+        hidden_states = self.dropout(inputs_embeds)
+        for i, layer_module in enumerate(self.block):
+            layer_head_mask = head_mask[i]
+            cross_attn_layer_head_mask = cross_attn_head_mask[i]
+            # Model parallel
+            if self.model_parallel:
+                torch.cuda.set_device(hidden_states.device)
+                # Ensure that attention_mask is always on the same device as hidden_states
+                if causal_mask is not None:
+                    causal_mask = causal_mask.to(hidden_states.device)
+                if position_bias is not None:
+                    position_bias = position_bias.to(hidden_states.device)
+                if encoder_hidden_states is not None:
+                    encoder_hidden_states = encoder_hidden_states.to(hidden_states.device)
+                if encoder_extended_attention_mask is not None:
+                    encoder_extended_attention_mask = encoder_extended_attention_mask.to(hidden_states.device)
+                if encoder_decoder_position_bias is not None:
+                    encoder_decoder_position_bias = encoder_decoder_position_bias.to(hidden_states.device)
+                if layer_head_mask is not None:
+                    layer_head_mask = layer_head_mask.to(hidden_states.device)
+                if cross_attn_layer_head_mask is not None:
+                    cross_attn_layer_head_mask = cross_attn_layer_head_mask.to(hidden_states.device)
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer_outputs = layer_module(
+                hidden_states,
+                causal_mask,
+                position_bias,
+                encoder_hidden_states,
+                encoder_extended_attention_mask,
+                encoder_decoder_position_bias,  # as a positional argument for gradient checkpointing
+                layer_head_mask=layer_head_mask,
+                cross_attn_layer_head_mask=cross_attn_layer_head_mask,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+                return_dict=return_dict,
+                cache_position=cache_position,
+                relative_position=relative_position
+            )
+            hidden_states = layer_outputs[0]
+            # We share the position biases between the layers - the first layer store them
+            # layer_outputs = hidden-states, key-valPilot phaseue-states (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            position_bias = layer_outputs[1]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[3 if output_attentions else 2]
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[2],)
+                if self.is_decoder:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[4],)
+            # Model Parallel: If it's the last layer for that device, put things on the next device
+            if self.model_parallel:
+                for k, v in self.device_map.items():
+                    if i == v[-1] and "cuda:" + str(k) != self.last_device:
+                        hidden_states = hidden_states.to("cuda:" + str(k + 1))
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    past_key_values,
+                    all_hidden_states,
+                    all_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+class M5Block(T5Block):
+    def __init__(self, config, has_relative_attention_bias=False, layer_idx: Optional[int] = None):
+        super().__init__(config, has_relative_attention_bias, layer_idx)
+        self.layer = nn.ModuleList()
+        self.layer.append(
+            M5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias, layer_idx=layer_idx)
+        )
+        if self.is_decoder:
+            self.layer.append(M5LayerSelfAttention(config, layer_idx=layer_idx))
+        self.layer.append(T5LayerFF(config))
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+        layer_head_mask=None,
+        cross_attn_layer_head_mask=None,
+        past_key_values=None,
+        use_cache=False,
+        output_attentions=False,
+        return_dict=True,
+        cache_position=None,
+        relative_position=None,
+    ):
+        self_attention_outputs = self.layer[0](
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            layer_head_mask=layer_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            cache_position=cache_position,
+            relative_position=relative_position
+        )
+        hidden_states = self_attention_outputs[0]
+        attention_outputs = self_attention_outputs[1:]  # Keep self-attention outputs and relative position weights
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == torch.float16:
+            clamp_value = torch.where(
+                torch.isinf(hidden_states).any(),
+                torch.finfo(hidden_states.dtype).max - 1000,
+                torch.finfo(hidden_states.dtype).max,
+            )
+            hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+        if do_cross_attention:
+            cross_attention_outputs = self.layer[1](
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+                layer_head_mask=cross_attn_layer_head_mask,
+                past_key_values=past_key_values,
+                query_length=cache_position[-1] + 1,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states = cross_attention_outputs[0]
+            # clamp inf values to enable fp16 training
+            if hidden_states.dtype == torch.float16:
+                clamp_value = torch.where(
+                    torch.isinf(hidden_states).any(),
+                    torch.finfo(hidden_states.dtype).max - 1000,
+                    torch.finfo(hidden_states.dtype).max,
+                )
+                hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+            # Keep cross-attention outputs and relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[1:]
+        # Apply Feed Forward layer
+        hidden_states = self.layer[-1](hidden_states)
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == torch.float16:
+            clamp_value = torch.where(
+                torch.isinf(hidden_states).any(),
+                torch.finfo(hidden_states.dtype).max - 1000,
+                torch.finfo(hidden_states.dtype).max,
+            )
+            hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+        outputs = (hidden_states,)
+        return (
+            outputs + attention_outputs
+        )  # hidden-states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+class M5LayerSelfAttention(T5LayerSelfAttention):
+    def __init__(self, config, has_relative_attention_bias=False, layer_idx: Optional[int] = None):
+        super().__init__(config, has_relative_attention_bias, layer_idx)
+        self.SelfAttention = M5Attention(config, has_relative_attention_bias=has_relative_attention_bias, layer_idx=layer_idx)
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        layer_head_mask=None,
+        past_key_values=None,
+        use_cache=False,
+        output_attentions=False,
+        cache_position=None,
+        relative_position=None,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.SelfAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+            layer_head_mask=layer_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            cache_position=cache_position,
+            relative_position=relative_position
+        )
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+class M5Attention(T5Attention):
+    """
+    def __init__(
+        self,
+        config: T5Config,
+        has_relative_attention_bias=False,
+        layer_idx: Optional[int] = None,
+    ):
+        super().__init__(config, has_relative_attention_bias, layer_idx)
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+        else:
+            self.elaborate = nn.Linear()
+    """
+    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+        past_key_values=None,
+        layer_head_mask=None,
+        query_length=None,
+        use_cache=False,
+        output_attentions=False,
+        cache_position=None,
+        relative_position=None
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, 1, 1, key_length) (non-causal encoder) or (batch_size, 1, seq_length, key_length) (causal decoder)
+        batch_size, seq_length = hidden_states.shape[:2]
+        # if key_value_states are provided this layer is used as a cross-attention layer for the decoder
+        is_cross_attention = key_value_states is not None
+        query_states = self.q(hidden_states)
+        query_states = query_states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+        # Check is encoder-decoder model is being used. Otherwise we'll get `DynamicCache`
+        is_updated = False
+        if isinstance(past_key_values, EncoderDecoderCache):
+            is_updated = past_key_values.is_updated.get(self.layer_idx)
+            if is_cross_attention:
+                # after the first generated id, we can subsequently re-use all key/value_states from cache
+                curr_past_key_value = past_key_values.cross_attention_cache
+            else:
+                curr_past_key_value = past_key_values.self_attention_cache
+        else:
+            curr_past_key_value = past_key_values
+        current_states = key_value_states if is_cross_attention else hidden_states
+        if is_cross_attention and past_key_values is not None and is_updated:
+            # reuse k,v, cross_attentions
+            key_states = curr_past_key_value.layers[self.layer_idx].keys
+            value_states = curr_past_key_value.layers[self.layer_idx].values
+        else:
+            key_states = self.k(current_states)
+            value_states = self.v(current_states)
+            key_states = key_states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+            value_states = value_states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).transpose(1, 2)
+            if past_key_values is not None:
+                # save all key/value_states to cache to be re-used for fast auto-regressive generation
+                cache_position = cache_position if not is_cross_attention else None
+                key_states, value_states = curr_past_key_value.update(
+                    key_states, value_states, self.layer_idx, {"cache_position": cache_position}
+                )
+                # set flag that curr layer for cross-attn is already updated so we can re-use in subsequent calls
+                if is_cross_attention and isinstance(past_key_values, EncoderDecoderCache):
+                    past_key_values.is_updated[self.layer_idx] = True
+        # compute scores, equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9
+        scores = torch.matmul(query_states, key_states.transpose(3, 2))
+        if position_bias is None:
+            key_length = key_states.shape[-2]
+            # cache position is 0-indexed so we add 1 to get the real length of queries (aka with past)
+            real_seq_length = query_length if query_length is not None else cache_position[-1] + 1
+            if not self.has_relative_attention_bias:
+                position_bias = torch.zeros(
+                    (1, self.n_heads, seq_length, key_length), device=scores.device, dtype=scores.dtype
+                )
+                if self.gradient_checkpointing and self.training:
+                    position_bias.requires_grad = True
+            else:
+                position_bias = self.compute_bias(
+                    real_seq_length, key_length, device=scores.device, cache_position=cache_position, relative_position=relative_position
+                )
+                position_bias = position_bias[:, :, -seq_length:, :]
+            if mask is not None:
+                causal_mask = mask[:, :, :, : key_states.shape[-2]]
+                position_bias = position_bias + causal_mask
+        if self.pruned_heads:
+            mask = torch.ones(position_bias.shape[1])
+            mask[list(self.pruned_heads)] = 0
+            position_bias_masked = position_bias[:, mask.bool()]
+        else:
+            position_bias_masked = position_bias
+        scores += position_bias_masked
+        # (batch_size, n_heads, seq_length, key_length)
+        attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(scores)
+        attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+        # Mask heads if we want to
+        if layer_head_mask is not None:
+            attn_weights = attn_weights * layer_head_mask
+        attn_output = torch.matmul(attn_weights, value_states)
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.view(batch_size, -1, self.inner_dim)
+        attn_output = self.o(attn_output)
+        outputs = (attn_output, position_bias)
+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        return outputs
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+        Args:
+            relative_position: an int32 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+        Returns:
+            a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
+        """
+        # Make all positions positive, effectively using the non-bidirectional path
+        # However, it uses positive distances instead of negative
+        relative_position = relative_position + 1
+        relative_position = torch.max(relative_position, torch.zeros_like(relative_position))
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+        num_log_buckets = num_buckets - max_exact - 1
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_position_if_large = max_exact + (
+            torch.log(relative_position.float() / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - num_log_buckets)
+        ).to(torch.long)
+        relative_position_if_large = torch.min(
+            relative_position_if_large, torch.full_like(relative_position_if_large, num_buckets - 2)
+        )
+        relative_buckets = torch.where(is_small, relative_position, relative_position_if_large)
+        # The +1 is because we added 1 at the beginning (relative_position + 1).
+        # This special mask is the equivalent of +inf distance and is assigned
+        # to the last bucket.
+        special_mask = (relative_position == np.iinfo(np.int16).max+1)
+        relative_buckets[special_mask] = num_buckets-1
+        return relative_buckets
+    def compute_bias(self, query_length, key_length, device=None, cache_position=None, relative_position=None):
+        """Compute binned relative position bias"""
+        if device is None:
+            device = self.relative_attention_bias.weight.device
+        if relative_position is None:
+            if cache_position is None:
+                context_position = torch.arange(query_length, dtype=torch.long, device=device)[:, None]
+            else:
+                context_position = cache_position[:, None].to(device)
+            memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :]
+            relative_position = memory_position - context_position  # shape (query_length, key_length)
+        # Removing relative_position calculation breaks cache_position but it's fine since the positions are precomputed anyways
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=(not self.is_decoder),
+            num_buckets=self.relative_attention_num_buckets,
+            max_distance=self.relative_attention_max_distance,
+        )
+        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
+        values = values.permute([0, 3, 1, 2])  # shape (batch_size, num_heads, query_length, key_length)
+        return values
+# RegressionHead for tasks froms groups 0, 1, 2 and 3
+# Used as regression head and classification head for pretraining
+class M5RegressionHead(nn.Module):
+    def __init__(self, config: T5Config):
+        super().__init__()
+        self.pooler = M5Pooler(config)
+        self.transform = nn.Linear(config.d_model, config.d_model)
+        if config.is_gated_act:
+            self.DenseReluDense = T5DenseGatedActDense(config)
+        else:
+            self.DenseReluDense = T5DenseActDense(config)
+        self.out_proj = nn.Linear(config.d_model, config.num_labels)
+    def forward(self, input_ids: torch.Tensor, hidden_states: torch.Tensor) -> torch.Tensor:
+        pooled = self.pooler(input_ids, hidden_states)
+        pooled = self.transform(pooled)
+        pooled = self.DenseReluDense(pooled)
+        output = self.out_proj(pooled)
+        return output
+# TokenRegressionHead for tasks from group 4
+class M5TokenRegressionHead(nn.Module):
+    def __init__(self, config: T5Config):
+        super().__init__()
+        # Dimension is multiplied by 2 to account for CLS dimensional embeddings.
+        self.transform1 = nn.Linear(config.d_model*2, config.d_model)
+        if config.is_gated_act:
+            self.DenseReluDense1 = T5DenseGatedActDense(config)
+        else:
+            self.DenseReluDense1 = T5DenseActDense(config)
+        self.transform2 = nn.Linear(config.d_model, config.d_model)
+        if config.is_gated_act:
+            self.DenseReluDense2 = T5DenseGatedActDense(config)
+        else:
+            self.DenseReluDense2 = T5DenseActDense(config)
+        # The output has shape (num_batches, context_length, 1) because each token has a label
+        self.output = nn.Linear(config.d_model, 1)
+        self.config = config
+    def forward(self, token_hidden_states: torch.Tensor) -> torch.Tensor:
+        # Concatenate CLS token hidden states to each token hidden state
+        #hidden_states = torch.cat([token_hidden_states, cls_hidden_states], dim=-1)
+        cls_hidden = token_hidden_states[:, 0, :]
+        token_hidden = token_hidden_states[:, 1:, :]
+        cls_repeated = cls_hidden.unsqueeze(1).expand(-1, token_hidden.size(1), -1)
+        augmented_hidden = torch.cat([token_hidden, cls_repeated], dim=-1).contiguous()
+        transformed = self.transform1(augmented_hidden)
+        transformed = self.DenseReluDense1(transformed)
+        transformed = self.transform2(transformed)
+        transformed = self.DenseReluDense2(transformed)
+        output = self.output(transformed)
+        output = output.squeeze(-1)
+        # (batch_size, num_labels)
+        # NOTE: num_labels = seq_length
+        return output
+class M5PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = T5Config
+    load_tf_weights = load_tf_weights_in_t5
+    base_model_prefix = "transformer"
+    is_parallelizable = True
+    supports_gradient_checkpointing = True
+    _supports_quantized_cache = False  # enc-dec models don't support yet
+    _supports_static_cache = True
+    _supports_cache_class = True
+    _no_split_modules = ["T5Block"]
+    _keep_in_fp32_modules = ["wo"]
+    @property
+    def dummy_inputs(self):
+        input_ids = torch.tensor(DUMMY_INPUTS)
+        input_mask = torch.tensor(DUMMY_MASK)
+        dummy_inputs = {
+            "decoder_input_ids": input_ids,
+            "input_ids": input_ids,
+            "decoder_attention_mask": input_mask,
+        }
+        return dummy_inputs
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor  # Used for testing weights initialization
+        if isinstance(module, T5LayerNorm):
+            module.weight.data.fill_(factor * 1.0)
+        elif isinstance(
+            module,
+            (T5Model, T5ForConditionalGeneration, T5EncoderModel, T5ForQuestionAnswering),
+        ):
+            # Mesh TensorFlow embeddings initialization
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
+            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
+            if hasattr(module, "lm_head") and not self.config.tie_word_embeddings:
+                module.lm_head.weight.data.normal_(mean=0.0, std=factor * 1.0)
+            if hasattr(module, "qa_outputs"):
+                module.qa_outputs.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+                module.qa_outputs.bias.data.zero_()
+        elif isinstance(module, T5ForTokenClassification):
+            if hasattr(module, "classifier"):
+                module.classifier.weight.data.normal_(mean=0.0, std=factor * 1.0)
+                module.classifier.bias.data.zero_()
+        elif isinstance(module, T5ClassificationHead):
+            module.dense.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.dense, "bias") and module.dense.bias is not None:
+                module.dense.bias.data.zero_()
+            module.out_proj.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.out_proj, "bias") and module.out_proj.bias is not None:
+                module.out_proj.bias.data.zero_()
+        elif isinstance(module, T5DenseActDense):
+            # Mesh TensorFlow FF initialization
+            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
+            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
+            module.wi.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi, "bias") and module.wi.bias is not None:
+                module.wi.bias.data.zero_()
+            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+            if hasattr(module.wo, "bias") and module.wo.bias is not None:
+                module.wo.bias.data.zero_()
+        elif isinstance(module, T5DenseGatedActDense):
+            module.wi_0.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi_0, "bias") and module.wi_0.bias is not None:
+                module.wi_0.bias.data.zero_()
+            module.wi_1.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.wi_1, "bias") and module.wi_1.bias is not None:
+                module.wi_1.bias.data.zero_()
+            module.wo.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_ff) ** -0.5))
+            if hasattr(module.wo, "bias") and module.wo.bias is not None:
+                module.wo.bias.data.zero_()
+        elif isinstance(module, M5RegressionHead):
+            module.transform.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.transform, "bias") and module.transform.bias is not None:
+                module.transform.bias.data.zero_()
+            module.out_proj.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            if hasattr(module.out_proj, "bias") and module.out_proj.bias is not None:
+                module.out_proj.bias.data.zero_()
+        elif isinstance(module, M5TokenRegressionHead):
+            module.transform1.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model*2) ** -0.5))
+            module.transform1.bias.data.zero_()
+            module.transform2.weight.data.normal_(mean=0.0, std=factor * ((self.config.d_model) ** -0.5))
+            module.transform2.bias.data.zero_()
+            module.output.weight.data.normal_(mean=0.0, std=factor * ((37.84) **  -0.5))
+            module.output.bias.data.zero_()
+        elif isinstance(module, T5Attention):
+            # Mesh TensorFlow attention initialization to avoid scaling before softmax
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
+            d_model = self.config.d_model
+            key_value_proj_dim = self.config.d_kv
+            n_heads = self.config.num_heads
+            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5))
+            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5))
+            if module.has_relative_attention_bias:
+                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+    def _shift_right(self, input_ids):
+        decoder_start_token_id = self.config.decoder_start_token_id
+        pad_token_id = self.config.pad_token_id
+        if decoder_start_token_id is None:
+            raise ValueError(
+                "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. "
+                "See T5 docs for more information."
+            )
+        # shift inputs to the right
+        if is_torch_fx_proxy(input_ids):
+            # Item assignment is not supported natively for proxies.
+            shifted_input_ids = torch.full(input_ids.shape[:-1] + (1,), decoder_start_token_id)
+            shifted_input_ids = torch.cat([shifted_input_ids, input_ids[..., :-1]], dim=-1)
+        else:
+            shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+            shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+            shifted_input_ids[..., 0] = decoder_start_token_id
+        if pad_token_id is None:
+            raise ValueError("self.model.config.pad_token_id has to be defined.")
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+        return shifted_input_ids
+class M5ModelForRegression(M5PreTrainedModel):
+    config_class = M5EncoderConfig
+    model_type = "m5_model"
+    def __init__(
+        self,
+        config: T5Config):
+        super().__init__(config)
+        self.encoder: M5Encoder = M5Encoder(config)
+        self.token_reg_head: M5TokenRegressionHead = M5TokenRegressionHead(config)
+        self.reg_head: M5RegressionHead = M5RegressionHead(config)
+        self.init_weights()
+    def forward(self, input_ids, attention_mask=None, relative_position=None, **kwargs):
+        output = self.encoder(input_ids, attention_mask, relative_position=relative_position, **kwargs)
+        hidden_states = output.last_hidden_state
+        tokreg_head = self.token_reg_head(hidden_states)
+        reg_head = self.reg_head(input_ids, hidden_states)
+        concatenated_preds = torch.cat([reg_head, tokreg_head], dim=-1)
+        return concatenated_preds

prepare_data.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import selfies as sf
+from rdkit import Chem
+import ast
+import numpy as np
+# Get molecule old smiles to permuted smiles correspondence for token_regr
+def __get_correspondence__(mol, epoch):
+    if epoch == 0:
+        new_smiles = Chem.MolToSmiles(mol, canonical=True)
+    else:
+        new_smiles = Chem.MolToRandomSmilesVect(mol, 1, randomSeed=epoch)[0]
+    output_order = mol.GetProp('_smilesAtomOutputOrder')
+    mapping = ast.literal_eval(output_order)
+    return new_smiles, mapping
+# We already know the [Ring] token connects the token immediately before...
+def get_ring_masks(mol, map_smiles_to_selfies, tokens):
+    # This is fine, atoms are given indices in the molecule based on the order they appear in the SMILES
+    Chem.FastFindRings(mol)
+    rings = mol.GetRingInfo().AtomRings()
+    ring_masks = []
+    for i, ring in enumerate(rings):
+        selfies_ring = map_smiles_to_selfies[list(ring)]
+        ring_idx = selfies_ring.max()+1
+        ring_masks.append((ring_idx, selfies_ring))
+        assert "Ring" in tokens[ring_idx]
+    return ring_masks
+# Distances are set to 0 for the tokens in the molecules at the right and at the left of . tokens (except padding tokens)
+def __get_attribution_mapping__(tokens):
+    special_token_masks = []
+    map_smiles_to_selfies = []
+    dots = []
+    idx = 1  # Start after [CLS]
+    while idx < len(tokens):
+        token = tokens[idx]
+        if token == ".":
+            dots.append(idx)
+            idx += 1
+            continue
+        branch_idx = token.find("Branch")
+        if branch_idx >= 0:
+            n = int(token[branch_idx + 6])
+            special_token_masks.append(np.arange(idx, idx + n + 1, dtype=np.int16))
+            idx += n + 1
+            continue
+        else:
+            ring_idx = token.find("Ring")
+            if ring_idx >= 0:
+                n = int(token[ring_idx + 4])
+                special_token_masks.append(np.arange(idx, idx + n + 1, dtype=np.int16))
+                idx += n + 1
+                continue
+        # Real (atom) token
+        map_smiles_to_selfies.append(idx)
+        idx += 1
+    # Existing dot_masks construction (unchanged)
+    dot_masks = []
+    last_dots = [-1]
+    for dot_idx in dots:
+        if len(last_dots) == 2:
+            val = last_dots.pop(0)
+            dot_masks.append([el for el in range(val + 1, dot_idx, 1)])
+        last_dots.append(dot_idx)
+    if len(dots) >= 1:
+        dot_masks.append([el for el in range(last_dots.pop(0) + 1, len(tokens), 1)])
+    return special_token_masks, np.array(map_smiles_to_selfies), list(zip(dots, dot_masks, strict=True))
+def __get_positional_encodings__(mol, smiles_to_selfies, context_length, special_token_masks, double_masks, first_padding_token_idx):
+    ats = np.array(smiles_to_selfies, dtype=np.int64)
+    distance = Chem.GetDistanceMatrix(mol)
+    # Distance of encodings is capped at the int16 upper bound minus 1
+    # (because the int16 upper bound value is reserved for special distances)
+    limit = np.iinfo(np.int16).max
+    distance = np.minimum(distance, limit-1).astype(np.int16)
+    pos_encod = np.full((context_length, context_length), limit, dtype=np.int16)
+    # Set first row and column to 0 only for non-padding tokens (positions in ats)
+    pos_encod[0, :first_padding_token_idx] = 0
+    pos_encod[:first_padding_token_idx, 0] = 0
+    for m in special_token_masks:
+        pos_encod[m[:, None], m] = -1
+    for i, m in double_masks:
+        pos_encod[i, m] = 0
+        pos_encod[m, i] = 0
+    np.fill_diagonal(pos_encod, 0)
+    # Use advanced indexing for distance assignment
+    pos_encod[ats[:, None], ats] = distance
+    return pos_encod
+def get_positional_encodings_and_align(smiles, token_regr, epoch):
+    orig_mol = Chem.MolFromSmiles(smiles, sanitize = False)
+    # Converts SMILES to the final SMILES so that the mapping is already correct for the token-level labels.
+    # Generates a predictable variation of the SMILES.
+    new_smiles, mapping_to_new = __get_correspondence__(orig_mol, epoch)
+    # Convert to SELFIES, simulate tokenization and add [CLS] token at the beginning
+    selfies = sf.encoder(new_smiles)
+    tokens = ["[CLS]"] + list(sf.split_selfies(selfies))
+    special_token_masks, map_smiles_to_selfies, dot_masks = __get_attribution_mapping__(tokens)
+    # Align token labels to SELFIES tokens
+    if token_regr is not None:
+        # Align token labels to the new SMILES
+        token_regr[:len(mapping_to_new)] = token_regr[mapping_to_new]
+        token_regr_selfies = np.full(len(tokens)-1, np.nan, dtype=token_regr.dtype)
+        valid = map_smiles_to_selfies < len(tokens)
+        token_regr_selfies[map_smiles_to_selfies[valid] - 1] = token_regr[:np.sum(valid)]
+    else:
+        token_regr_selfies = None
+    # Generate molecule from the new SMILES (remove sanitization to preserve the original structure)
+    mol = Chem.MolFromSmiles(new_smiles, sanitize = False)
+    ring_masks = get_ring_masks(mol, map_smiles_to_selfies, tokens)
+    double_masks = ring_masks + dot_masks
+    pos_encod = __get_positional_encodings__(mol, map_smiles_to_selfies, len(tokens), special_token_masks, double_masks, len(tokens))
+    return selfies, pos_encod, token_regr_selfies

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,1139 @@

+{
+  "version": "1.0",
+  "truncation": null,
+  "padding": null,
+  "added_tokens": [
+    {
+      "id": 0,
+      "content": "[UNK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 1,
+      "content": "[CLS]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 2,
+      "content": "[PAD]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 3,
+      "content": "[MASK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    }
+  ],
+  "normalizer": {
+    "type": "Replace",
+    "pattern": {
+      "String": "\n"
+    },
+    "content": ""
+  },
+  "pre_tokenizer": {
+    "type": "Split",
+    "pattern": {
+      "Regex": "\\[.+?\\]|\\."
+    },
+    "behavior": "Isolated",
+    "invert": false
+  },
+  "post_processor": {
+    "type": "TemplateProcessing",
+    "single": [
+      {
+        "SpecialToken": {
+          "id": "[CLS]",
+          "type_id": 0
+        }
+      },
+      {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
+      }
+    ],
+    "pair": [
+      {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
+      },
+      {
+        "Sequence": {
+          "id": "B",
+          "type_id": 1
+        }
+      }
+    ],
+    "special_tokens": {
+      "[CLS]": {
+        "id": "[CLS]",
+        "ids": [
+          1
+        ],
+        "tokens": [
+          "[CLS]"
+        ]
+      }
+    }
+  },
+  "decoder": null,
+  "model": {
+    "type": "WordLevel",
+    "vocab": {
+      "[UNK]": 0,
+      "[CLS]": 1,
+      "[PAD]": 2,
+      "[MASK]": 3,
+      "[C]": 4,
+      "[=C]": 5,
+      "[=Branch1]": 6,
+      "[Branch1]": 7,
+      "[Ring1]": 8,
+      "[N]": 9,
+      "[=O]": 10,
+      "[O]": 11,
+      "[Ring2]": 12,
+      "[=N]": 13,
+      "[Branch2]": 14,
+      "[F]": 15,
+      "[S]": 16,
+      "[=Branch2]": 17,
+      "[Cl]": 18,
+      "[#Branch2]": 19,
+      "[#Branch1]": 20,
+      "[C@@H1]": 21,
+      "[C@H1]": 22,
+      "[Br]": 23,
+      "[#C]": 24,
+      "[P]": 25,
+      "[/C]": 26,
+      "[O-1]": 27,
+      "[#N]": 28,
+      "[N+1]": 29,
+      ".": 30,
+      "[=S]": 31,
+      "[I]": 32,
+      "[C@@]": 33,
+      "[C@]": 34,
+      "[/N]": 35,
+      "[Si]": 36,
+      "[2H]": 37,
+      "[/O]": 38,
+      "[=N+1]": 39,
+      "[B]": 40,
+      "[/S]": 41,
+      "[=N-1]": 42,
+      "[Na+1]": 43,
+      "[Cl-1]": 44,
+      "[#C-1]": 45,
+      "[NH1+1]": 46,
+      "[BH0]": 47,
+      "[K+1]": 48,
+      "[Br-1]": 49,
+      "[S@@]": 50,
+      "[/C@H1]": 51,
+      "[S@]": 52,
+      "[P+1]": 53,
+      "[NH3+1]": 54,
+      "[/Cl]": 55,
+      "[/C@@H1]": 56,
+      "[Se]": 57,
+      "[NH2+1]": 58,
+      "[I-1]": 59,
+      "[C-1]": 60,
+      "[Li+1]": 61,
+      "[B-1]": 62,
+      "[#N+1]": 63,
+      "[3H]": 64,
+      "[/N+1]": 65,
+      "[N-1]": 66,
+      "[CH1]": 67,
+      "[H+1]": 68,
+      "[13C]": 69,
+      "[S-1]": 70,
+      "[CH2-1]": 71,
+      "[Mg+2]": 72,
+      "[P@@]": 73,
+      "[=P]": 74,
+      "[P@]": 75,
+      "[S+1]": 76,
+      "[/F]": 77,
+      "[/O-1]": 78,
+      "[As]": 79,
+      "[/Br]": 80,
+      "[SiH1]": 81,
+      "[18F]": 82,
+      "[NH4+1]": 83,
+      "[Al]": 84,
+      "[13CH2]": 85,
+      "[Ge]": 86,
+      "[Sn]": 87,
+      "[Ca+2]": 88,
+      "[13CH1]": 89,
+      "[OH1-1]": 90,
+      "[/I]": 91,
+      "[Zn+2]": 92,
+      "[/Si]": 93,
+      "[=13CH1]": 94,
+      "[=C-1]": 95,
+      "[Zn]": 96,
+      "[Na]": 97,
+      "[SiH2]": 98,
+      "[=NH1+1]": 99,
+      "[/-Ring1]": 100,
+      "[/P]": 101,
+      "[14C]": 102,
+      "[=13C]": 103,
+      "[Te]": 104,
+      "[13CH3]": 105,
+      "[H]": 106,
+      "[Li]": 107,
+      "[Mg]": 108,
+      "[CH1-1]": 109,
+      "[PH1+1]": 110,
+      "[=Se]": 111,
+      "[Zn+1]": 112,
+      "[SiH3]": 113,
+      "[/C@@]": 114,
+      "[/C@]": 115,
+      "[#P]": 116,
+      "[P-1]": 117,
+      "[15NH1]": 118,
+      "[=NH2+1]": 119,
+      "[PH3+1]": 120,
+      "[F-1]": 121,
+      "[CH0]": 122,
+      "[13C@@H1]": 123,
+      "[11CH3]": 124,
+      "[Ca]": 125,
+      "[15N]": 126,
+      "[13C@H1]": 127,
+      "[14CH1]": 128,
+      "[Cu+1]": 129,
+      "[14CH2]": 130,
+      "[15NH2]": 131,
+      "[NH1-1]": 132,
+      "[=14CH1]": 133,
+      "[125I]": 134,
+      "[=O+1]": 135,
+      "[Sb]": 136,
+      "[CH2]": 137,
+      "[SeH1]": 138,
+      "[SH2+1]": 139,
+      "[Ga]": 140,
+      "[11C]": 141,
+      "[=14C]": 142,
+      "[CH3-1]": 143,
+      "[14CH3]": 144,
+      "[=15N]": 145,
+      "[123I]": 146,
+      "[Al+1]": 147,
+      "[=Si]": 148,
+      "[=18O]": 149,
+      "[K]": 150,
+      "[Sn+2]": 151,
+      "[H-1]": 152,
+      "[OH0]": 153,
+      "[PH2+1]": 154,
+      "[OH2+1]": 155,
+      "[CH2+1]": 156,
+      "[/Se]": 157,
+      "[=CH0]": 158,
+      "[Se-1]": 159,
+      "[Al-1]": 160,
+      "[Sb-1]": 161,
+      "[O+1]": 162,
+      "[In]": 163,
+      "[C+1]": 164,
+      "[/S@]": 165,
+      "[N@+1]": 166,
+      "[Cu]": 167,
+      "[131I]": 168,
+      "[SnH1]": 169,
+      "[/S@@]": 170,
+      "[=CH1-1]": 171,
+      "[N@@+1]": 172,
+      "[1H]": 173,
+      "[18OH1]": 174,
+      "[GeH1]": 175,
+      "[=S@]": 176,
+      "[/P+1]": 177,
+      "[19F]": 178,
+      "[Al+3]": 179,
+      "[14C@H1]": 180,
+      "[As+1]": 181,
+      "[14C@@H1]": 182,
+      "[18O]": 183,
+      "[Si@@]": 184,
+      "[SnH2]": 185,
+      "[GeH3]": 186,
+      "[=S@@]": 187,
+      "[HH1]": 188,
+      "[Sn+1]": 189,
+      "[GeH2]": 190,
+      "[Si@]": 191,
+      "[#O+1]": 192,
+      "[CH1+1]": 193,
+      "[#S]": 194,
+      "[SnH3]": 195,
+      "[AsH1]": 196,
+      "[15N+1]": 197,
+      "[#NH1+1]": 198,
+      "[124I]": 199,
+      "[11CH2]": 200,
+      "[/-Ring2]": 201,
+      "[Al+2]": 202,
+      "[16OH1]": 203,
+      "[Si-1]": 204,
+      "[Ar]": 205,
+      "[/13CH1]": 206,
+      "[/2H]": 207,
+      "[13C@@]": 208,
+      "[PH1-1]": 209,
+      "[#15N]": 210,
+      "[/13C]": 211,
+      "[NH0]": 212,
+      "[13C@]": 213,
+      "[12C]": 214,
+      "[Ag]": 215,
+      "[BH3-1]": 216,
+      "[=C+1]": 217,
+      "[NH2-1]": 218,
+      "[Pd]": 219,
+      "[AsH2]": 220,
+      "[As-1]": 221,
+      "[=Te]": 222,
+      "[Ti]": 223,
+      "[Be+2]": 224,
+      "[PH4+1]": 225,
+      "[BH2-1]": 226,
+      "[#CH0]": 227,
+      "[=13CH2]": 228,
+      "[SH1+1]": 229,
+      "[32P]": 230,
+      "[/NH1+1]": 231,
+      "[=CH1]": 232,
+      "[35S]": 233,
+      "[/Te]": 234,
+      "[Be]": 235,
+      "[Ni+2]": 236,
+      "[SH1-1]": 237,
+      "[=17O]": 238,
+      "[/Ge]": 239,
+      "[11CH1]": 240,
+      "[/CH2-1]": 241,
+      "[/SiH1]": 242,
+      "[Se+1]": 243,
+      "[=OH1+1]": 244,
+      "[NH1]": 245,
+      "[OH1+1]": 246,
+      "[O-2]": 247,
+      "[=NH0]": 248,
+      "[=SiH2]": 249,
+      "[BH1-1]": 250,
+      "[TeH1]": 251,
+      "[76Br]": 252,
+      "[SH3+1]": 253,
+      "[/123I]": 254,
+      "[#13C]": 255,
+      "[=SiH1]": 256,
+      "[17OH1]": 257,
+      "[SH0]": 258,
+      "[Si@@H1]": 259,
+      "[=As]": 260,
+      "[18O-1]": 261,
+      "[/13CH2]": 262,
+      "[Si@H1]": 263,
+      "[17O]": 264,
+      "[35Cl]": 265,
+      "[3HH1]": 266,
+      "[=S+1]": 267,
+      "[Pd+2]": 268,
+      "[/B]": 269,
+      "[37Cl]": 270,
+      "[P-2]": 271,
+      "[Si+1]": 272,
+      "[/125I]": 273,
+      "[=SH1+1]": 274,
+      "[Cd]": 275,
+      "[Te+1]": 276,
+      "[Zn-2]": 277,
+      "[/Al]": 278,
+      "[=Sn]": 279,
+      "[12CH1]": 280,
+      "[=11C]": 281,
+      "[=15NH1]": 282,
+      "[16N]": 283,
+      "[Ag+1]": 284,
+      "[=Ge]": 285,
+      "[=CH1+1]": 286,
+      "[Pd+1]": 287,
+      "[12CH3]": 288,
+      "[=Zn]": 289,
+      "[/C-1]": 290,
+      "[/S-1]": 291,
+      "[Ti+4]": 292,
+      "[14NH1]": 293,
+      "[Ga+3]": 294,
+      "[GeH4]": 295,
+      "[#Si]": 296,
+      "[16O]": 297,
+      "[AlH1]": 298,
+      "[AlH2]": 299,
+      "[#C+1]": 300,
+      "[=P@@]": 301,
+      "[=Ring1]": 302,
+      "[/13CH3]": 303,
+      "[/S+1]": 304,
+      "[=P@]": 305,
+      "[PH0]": 306,
+      "[10B]": 307,
+      "[77Br]": 308,
+      "[=12CH1]": 309,
+      "[Ti+2]": 310,
+      "[/14C]": 311,
+      "[/CH1]": 312,
+      "[/SiH2]": 313,
+      "[He]": 314,
+      "[/N-1]": 315,
+      "[/NH3+1]": 316,
+      "[13NH2]": 317,
+      "[SbH2]": 318,
+      "[/As]": 319,
+      "[12CH2]": 320,
+      "[=14N]": 321,
+      "[/14CH1]": 322,
+      "[=12C]": 323,
+      "[=35S]": 324,
+      "[=P+1]": 325,
+      "[=16O]": 326,
+      "[=Ti]": 327,
+      "[In+1]": 328,
+      "[Pt+2]": 329,
+      "[#13CH1]": 330,
+      "[14NH2]": 331,
+      "[2HH1]": 332,
+      "[=Ti+2]": 333,
+      "[S-2]": 334,
+      "[14N]": 335,
+      "[33P]": 336,
+      "[Pt]": 337,
+      "[=11CH1]": 338,
+      "[AlH3]": 339,
+      "[BH4-1]": 340,
+      "[Ni]": 341,
+      "[/SiH3]": 342,
+      "[Cd+2]": 343,
+      "[Cr]": 344,
+      "[PH2-1]": 345,
+      "[Pb]": 346,
+      "[Sn+3]": 347,
+      "[Sn+4]": 348,
+      "[/Sn]": 349,
+      "[15NH3+1]": 350,
+      "[75Se]": 351,
+      "[=GeH2]": 352,
+      "[In-1]": 353,
+      "[Sn-1]": 354,
+      "[13C-1]": 355,
+      "[16NH1]": 356,
+      "[=14CH2]": 357,
+      "[Hg]": 358,
+      "[In+3]": 359,
+      "[Rh]": 360,
+      "[Ru]": 361,
+      "[Sc+3]": 362,
+      "[/131I]": 363,
+      "[/NH2+1]": 364,
+      "[34S]": 365,
+      "[35SH1]": 366,
+      "[P@H1]": 367,
+      "[PH1]": 368,
+      "[#14N]": 369,
+      "[14C@]": 370,
+      "[=S-1]": 371,
+      "[Cu+2]": 372,
+      "[I+1]": 373,
+      "[Sb+1]": 374,
+      "[/SeH1]": 375,
+      "[=P-1]": 376,
+      "[Sb+3]": 377,
+      "[127I]": 378,
+      "[14C@@]": 379,
+      "[15NH3]": 380,
+      "[16NH2]": 381,
+      "[75Br]": 382,
+      "[=Al]": 383,
+      "[=Mg]": 384,
+      "[Hg+2]": 385,
+      "[P@@H1]": 386,
+      "[Ti-2]": 387,
+      "[#13C-1]": 388,
+      "[#As]": 389,
+      "[/11C]": 390,
+      "[/P@]": 391,
+      "[14C-1]": 392,
+      "[18F-1]": 393,
+      "[6Li+1]": 394,
+      "[82Br]": 395,
+      "[=15N+1]": 396,
+      "[=34S]": 397,
+      "[Au]": 398,
+      "[Ga-1]": 399,
+      "[Kr]": 400,
+      "[Li-1]": 401,
+      "[Rh+3]": 402,
+      "[V]": 403,
+      "[15NH4+1]": 404,
+      "[7Li+1]": 405,
+      "[=BH0]": 406,
+      "[Cs+1]": 407,
+      "[Fe]": 408,
+      "[Ru+2]": 409,
+      "[#15N+1]": 410,
+      "[/18F]": 411,
+      "[/CH0]": 412,
+      "[/P@@]": 413,
+      "[80Br]": 414,
+      "[AlH2-1]": 415,
+      "[Bi]": 416,
+      "[GaH2]": 417,
+      "[PH2]": 418,
+      "[/14CH2]": 419,
+      "[/15NH1]": 420,
+      "[/76Br]": 421,
+      "[/CH1-1]": 422,
+      "[/PH1+1]": 423,
+      "[36Cl]": 424,
+      "[4H]": 425,
+      "[79Br]": 426,
+      "[=Ca]": 427,
+      "[=GeH1]": 428,
+      "[=Pd]": 429,
+      "[Au+1]": 430,
+      "[GaH3]": 431,
+      "[Gd]": 432,
+      "[Hf]": 433,
+      "[Pd-2]": 434,
+      "[SbH1]": 435,
+      "[Ti+3]": 436,
+      "[Y]": 437,
+      "[Zr]": 438,
+      "[#Sb]": 439,
+      "[#Si+1]": 440,
+      "[#SiH1]": 441,
+      "[121I]": 442,
+      "[13N]": 443,
+      "[14NH3]": 444,
+      "[15O]": 445,
+      "[17F]": 446,
+      "[28Si]": 447,
+      "[2H-1]": 448,
+      "[3H-1]": 449,
+      "[6H]": 450,
+      "[8CH1]": 451,
+      "[=15O]": 452,
+      "[=SiH1-1]": 453,
+      "[AlH1+2]": 454,
+      "[AlH1-1]": 455,
+      "[Ba+2]": 456,
+      "[Ba]": 457,
+      "[CH3+1]": 458,
+      "[Mg+1]": 459,
+      "[Ne]": 460,
+      "[OH3+1]": 461,
+      "[Si-2]": 462,
+      "[SiH4]": 463,
+      "[SnH4]": 464,
+      "[#Si-1]": 465,
+      "[/13C@@H1]": 466,
+      "[/13C@H1]": 467,
+      "[11C@H1]": 468,
+      "[11CH4]": 469,
+      "[122I]": 470,
+      "[125I-1]": 471,
+      "[14N+1]": 472,
+      "[15OH1]": 473,
+      "[17O-1]": 474,
+      "[18FH1]": 475,
+      "[5H]": 476,
+      "[77Se]": 477,
+      "[=33S]": 478,
+      "[=SH2]": 479,
+      "[AsH3]": 480,
+      "[BH1]": 481,
+      "[InH2]": 482,
+      "[Lu+3]": 483,
+      "[Mo]": 484,
+      "[Ti+1]": 485,
+      "[Y+3]": 486,
+      "[Zr+2]": 487,
+      "[#13N]": 488,
+      "[#Ge]": 489,
+      "[#Nb]": 490,
+      "[#P+1]": 491,
+      "[/Ga]": 492,
+      "[11C-1]": 493,
+      "[11C@@H1]": 494,
+      "[12C@@H1]": 495,
+      "[12C@@]": 496,
+      "[12C@]": 497,
+      "[131I-1]": 498,
+      "[13CH4]": 499,
+      "[15C]": 500,
+      "[2H+1]": 501,
+      "[8CH2]": 502,
+      "[=13O]": 503,
+      "[=14C-1]": 504,
+      "[=15N-1]": 505,
+      "[=AsH3]": 506,
+      "[=Pt]": 507,
+      "[Al-2]": 508,
+      "[AlH4-1]": 509,
+      "[Au-1]": 510,
+      "[Hg+1]": 511,
+      "[Ru+3]": 512,
+      "[SiH3-1]": 513,
+      "[Ta]": 514,
+      "[#Al]": 515,
+      "[#S+1]": 516,
+      "[/11CH3]": 517,
+      "[/15NH2]": 518,
+      "[/15N]": 519,
+      "[/Al-1]": 520,
+      "[/GeH1]": 521,
+      "[/NH1-1]": 522,
+      "[11B]": 523,
+      "[11CH3-1]": 524,
+      "[121Sb]": 525,
+      "[123I-1]": 526,
+      "[125IH1]": 527,
+      "[12C-1]": 528,
+      "[12C@H1]": 529,
+      "[13CH3-1]": 530,
+      "[13NH1]": 531,
+      "[14CH3-1]": 532,
+      "[14O]": 533,
+      "[15NH1+1]": 534,
+      "[1H+1]": 535,
+      "[32Cl]": 536,
+      "[33S]": 537,
+      "[68Ga+3]": 538,
+      "[74As]": 539,
+      "[75Ge]": 540,
+      "[82Se]": 541,
+      "[9CH2]": 542,
+      "[=15NH2+1]": 543,
+      "[=AsH1]": 544,
+      "[=Cr]": 545,
+      "[=Cu]": 546,
+      "[=Ga]": 547,
+      "[=Ni]": 548,
+      "[=Os]": 549,
+      "[=Sb]": 550,
+      "[=SeH1]": 551,
+      "[=SnH2]": 552,
+      "[=TeH2]": 553,
+      "[=Zr]": 554,
+      "[Al-3]": 555,
+      "[Co+3]": 556,
+      "[Fe+2]": 557,
+      "[GaH1]": 558,
+      "[Ge-2]": 559,
+      "[InH1]": 560,
+      "[Os]": 561,
+      "[Rb+1]": 562,
+      "[Sc]": 563,
+      "[SiH1-1]": 564,
+      "[Sr+2]": 565,
+      "[TeH2]": 566,
+      "[Zr+4]": 567,
+      "[#12CH1]": 568,
+      "[#14C]": 569,
+      "[#17O+1]": 570,
+      "[#18O+1]": 571,
+      "[#AsH1]": 572,
+      "[#Ga]": 573,
+      "[#In]": 574,
+      "[#Lu]": 575,
+      "[#Sc]": 576,
+      "[#Ta]": 577,
+      "[/124I]": 578,
+      "[/35Cl]": 579,
+      "[/37Cl]": 580,
+      "[/As+1]": 581,
+      "[/BH0]": 582,
+      "[/In]": 583,
+      "[/O+1]": 584,
+      "[/Sb]": 585,
+      "[120I]": 586,
+      "[124I-1]": 587,
+      "[129I]": 588,
+      "[14CH4]": 589,
+      "[15N-1]": 590,
+      "[29Si]": 591,
+      "[32PH2]": 592,
+      "[32S]": 593,
+      "[34SH1]": 594,
+      "[35Cl-1]": 595,
+      "[45Ca+2]": 596,
+      "[47Ca+2]": 597,
+      "[70Zn]": 598,
+      "[72Zn]": 599,
+      "[73Ge]": 600,
+      "[74Se]": 601,
+      "[76Br-1]": 602,
+      "[79BrH1]": 603,
+      "[7Be]": 604,
+      "[81BrH1]": 605,
+      "[81Br]": 606,
+      "[8CH4]": 607,
+      "[9CH1]": 608,
+      "[=11CH2]": 609,
+      "[=12CH2]": 610,
+      "[=13N]": 611,
+      "[=18CH2]": 612,
+      "[=32S]": 613,
+      "[=Ag]": 614,
+      "[=AlH1]": 615,
+      "[=Mo]": 616,
+      "[=PH2+1]": 617,
+      "[=SH0]": 618,
+      "[=SeH2]": 619,
+      "[=Ta]": 620,
+      "[=V]": 621,
+      "[=W]": 622,
+      "[Cr+2]": 623,
+      "[Ir]": 624,
+      "[Nb]": 625,
+      "[Ni-2]": 626,
+      "[OH1]": 627,
+      "[PbH3]": 628,
+      "[Rb]": 629,
+      "[Rh+2]": 630,
+      "[SbH1+1]": 631,
+      "[Si+4]": 632,
+      "[Tl+1]": 633,
+      "[Tl+3]": 634,
+      "[#11CH1]": 635,
+      "[#11C]": 636,
+      "[#14C-1]": 637,
+      "[#14CH1]": 638,
+      "[#15O+1]": 639,
+      "[#16O+1]": 640,
+      "[#17CH1]": 641,
+      "[#18CH1]": 642,
+      "[#Cr]": 643,
+      "[#GeH1]": 644,
+      "[#Mo+1]": 645,
+      "[#Mo]": 646,
+      "[#PH2]": 647,
+      "[#SH1-1]": 648,
+      "[#Se]": 649,
+      "[#Sn]": 650,
+      "[#Ti+1]": 651,
+      "[#V]": 652,
+      "[#Y]": 653,
+      "[/127I]": 654,
+      "[/14CH3]": 655,
+      "[/15N+1]": 656,
+      "[/18OH1]": 657,
+      "[/18O]": 658,
+      "[/32P]": 659,
+      "[/80Br]": 660,
+      "[/Al+1]": 661,
+      "[/CH2]": 662,
+      "[/GeH3]": 663,
+      "[/N@+1]": 664,
+      "[/N@@+1]": 665,
+      "[/NH0]": 666,
+      "[/OH0]": 667,
+      "[/PH3+1]": 668,
+      "[/Te+1]": 669,
+      "[/TeH1]": 670,
+      "[100Mo]": 671,
+      "[100Pd]": 672,
+      "[101Mo]": 673,
+      "[101Pd]": 674,
+      "[104Pd]": 675,
+      "[105Pd]": 676,
+      "[108Pd]": 677,
+      "[10B-1]": 678,
+      "[10BH3]": 679,
+      "[10Be]": 680,
+      "[10CH4]": 681,
+      "[10C]": 682,
+      "[111I-1]": 683,
+      "[111IH1]": 684,
+      "[111In+3]": 685,
+      "[111In]": 686,
+      "[112Pd]": 687,
+      "[117SnH2]": 688,
+      "[119Sn]": 689,
+      "[11NH3]": 690,
+      "[120I-1]": 691,
+      "[120IH1]": 692,
+      "[121I-1]": 693,
+      "[121IH1]": 694,
+      "[121SnH2]": 695,
+      "[122IH1]": 696,
+      "[123IH1]": 697,
+      "[123Te]": 698,
+      "[124IH1]": 699,
+      "[124Xe]": 700,
+      "[125Te]": 701,
+      "[126IH1]": 702,
+      "[126Xe]": 703,
+      "[127I-1]": 704,
+      "[127IH1]": 705,
+      "[127Xe]": 706,
+      "[128I-1]": 707,
+      "[128IH1]": 708,
+      "[128I]": 709,
+      "[129I-1]": 710,
+      "[129IH1]": 711,
+      "[12B]": 712,
+      "[12CH4]": 713,
+      "[12Li+1]": 714,
+      "[12OH1]": 715,
+      "[130I-1]": 716,
+      "[130IH1]": 717,
+      "[131IH1]": 718,
+      "[131Xe]": 719,
+      "[132I-1]": 720,
+      "[132IH1]": 721,
+      "[132Xe]": 722,
+      "[133I-1]": 723,
+      "[133IH1]": 724,
+      "[134I-1]": 725,
+      "[134IH1]": 726,
+      "[134Xe]": 727,
+      "[135I-1]": 728,
+      "[135IH1]": 729,
+      "[135I]": 730,
+      "[13CH1+1]": 731,
+      "[13CH2-1]": 732,
+      "[13NH3]": 733,
+      "[13OH2]": 734,
+      "[13O]": 735,
+      "[145Gd]": 736,
+      "[146Gd]": 737,
+      "[147Gd]": 738,
+      "[148Gd]": 739,
+      "[149Gd]": 740,
+      "[14CH2-1]": 741,
+      "[14NH4+1]": 742,
+      "[151Gd]": 743,
+      "[152Gd]": 744,
+      "[153Gd]": 745,
+      "[154Gd]": 746,
+      "[155Gd]": 747,
+      "[156Gd]": 748,
+      "[157Gd]": 749,
+      "[158Gd]": 750,
+      "[159Gd]": 751,
+      "[15CH3]": 752,
+      "[15CH4]": 753,
+      "[15NH2+1]": 754,
+      "[15OH2]": 755,
+      "[160Gd]": 756,
+      "[161Gd]": 757,
+      "[16CH1]": 758,
+      "[16CH3]": 759,
+      "[16C]": 760,
+      "[16F]": 761,
+      "[16NH3]": 762,
+      "[16O-1]": 763,
+      "[16OH1-1]": 764,
+      "[16OH2]": 765,
+      "[177Lu+3]": 766,
+      "[17CH1]": 767,
+      "[17CH2]": 768,
+      "[17FH1]": 769,
+      "[17NH3]": 770,
+      "[17OH1-1]": 771,
+      "[17OH2]": 772,
+      "[18CH1]": 773,
+      "[18CH2]": 774,
+      "[18OH1-1]": 775,
+      "[18OH2]": 776,
+      "[19B]": 777,
+      "[19FH1]": 778,
+      "[19Ne]": 779,
+      "[19OH2]": 780,
+      "[19O]": 781,
+      "[1H-1]": 782,
+      "[1HH1]": 783,
+      "[20CH1]": 784,
+      "[20Ne]": 785,
+      "[20OH1]": 786,
+      "[21CH4]": 787,
+      "[21NH3]": 788,
+      "[21Ne]": 789,
+      "[22CH4]": 790,
+      "[22Na+1]": 791,
+      "[22Ne]": 792,
+      "[24FH1]": 793,
+      "[24Mg]": 794,
+      "[24NH3]": 795,
+      "[24Na+1]": 796,
+      "[25FH1]": 797,
+      "[25Mg]": 798,
+      "[25OH1]": 799,
+      "[26FH1]": 800,
+      "[27Mg]": 801,
+      "[28F]": 802,
+      "[28Mg]": 803,
+      "[28SiH3]": 804,
+      "[30Si]": 805,
+      "[31PH3]": 806,
+      "[31P]": 807,
+      "[31Si]": 808,
+      "[32ClH1]": 809,
+      "[32PH3]": 810,
+      "[32SH2]": 811,
+      "[32Si]": 812,
+      "[33ClH1]": 813,
+      "[33PH3]": 814,
+      "[33SH2]": 815,
+      "[34ClH1]": 816,
+      "[34SH2]": 817,
+      "[35ClH1]": 818,
+      "[35P]": 819,
+      "[35S-1]": 820,
+      "[35SH2]": 821,
+      "[36Ar]": 822,
+      "[36Cl-1]": 823,
+      "[36ClH1]": 824,
+      "[36SH2]": 825,
+      "[37Ar]": 826,
+      "[37Cl-1]": 827,
+      "[37ClH1]": 828,
+      "[37SH2]": 829,
+      "[38Ar]": 830,
+      "[38Cl-1]": 831,
+      "[38ClH1]": 832,
+      "[38PH3]": 833,
+      "[38SH2]": 834,
+      "[39Ar]": 835,
+      "[39ClH1]": 836,
+      "[3He]": 837,
+      "[40Ar]": 838,
+      "[40Ca]": 839,
+      "[40PH3]": 840,
+      "[41Ar]": 841,
+      "[41Ca+2]": 842,
+      "[41Ca]": 843,
+      "[42Ca]": 844,
+      "[42K+1]": 845,
+      "[43Ca+2]": 846,
+      "[43Ca]": 847,
+      "[43K+1]": 848,
+      "[44Ca+2]": 849,
+      "[44Ca]": 850,
+      "[45Ca]": 851,
+      "[46Ca]": 852,
+      "[47Ca]": 853,
+      "[48Ca]": 854,
+      "[49Ca]": 855,
+      "[4HH1]": 856,
+      "[4He]": 857,
+      "[61Cu+1]": 858,
+      "[62Cu+1]": 859,
+      "[62Zn]": 860,
+      "[63Zn]": 861,
+      "[64Cu+1]": 862,
+      "[64Cu]": 863,
+      "[64Zn+2]": 864,
+      "[64Zn]": 865,
+      "[65Zn+2]": 866,
+      "[65Zn]": 867,
+      "[66Ge]": 868,
+      "[66Zn]": 869,
+      "[67Ga+3]": 870,
+      "[67Ge]": 871,
+      "[67Zn]": 872,
+      "[68Ga]": 873,
+      "[68Ge]": 874,
+      "[68Zn]": 875,
+      "[69Ge]": 876,
+      "[69Zn]": 877,
+      "[6He]": 878,
+      "[70As]": 879,
+      "[70Se]": 880,
+      "[71As]": 881,
+      "[71Ge]": 882,
+      "[71Se]": 883,
+      "[71Zn]": 884,
+      "[72As]": 885,
+      "[72BrH1]": 886,
+      "[72Ge]": 887,
+      "[72Se]": 888,
+      "[73Se]": 889,
+      "[74Br-1]": 890,
+      "[74BrH1]": 891,
+      "[74Ge]": 892,
+      "[74Kr]": 893,
+      "[75Br-1]": 894,
+      "[75BrH1]": 895,
+      "[76As]": 896,
+      "[76BrH1]": 897,
+      "[76Kr]": 898,
+      "[76Se]": 899,
+      "[77As]": 900,
+      "[77Br-1]": 901,
+      "[77BrH1]": 902,
+      "[77Ge]": 903,
+      "[77Kr]": 904,
+      "[78BrH1]": 905,
+      "[78Ge]": 906,
+      "[78Kr]": 907,
+      "[78Se]": 908,
+      "[79Kr]": 909,
+      "[79Se]": 910,
+      "[80Br-1]": 911,
+      "[80BrH1]": 912,
+      "[80Kr]": 913,
+      "[80Se]": 914,
+      "[80Sr]": 915,
+      "[81Kr]": 916,
+      "[81Se]": 917,
+      "[82Br-1]": 918,
+      "[82BrH1]": 919,
+      "[82Kr]": 920,
+      "[82Rb+1]": 921,
+      "[83Br-1]": 922,
+      "[83BrH1]": 923,
+      "[83Kr]": 924,
+      "[83Se]": 925,
+      "[84BrH1]": 926,
+      "[84Kr]": 927,
+      "[85Br]": 928,
+      "[85Kr]": 929,
+      "[86Kr]": 930,
+      "[86Rb+1]": 931,
+      "[86Zr]": 932,
+      "[87Kr]": 933,
+      "[87Sr]": 934,
+      "[88Kr]": 935,
+      "[88Zr]": 936,
+      "[89Kr]": 937,
+      "[89Zr]": 938,
+      "[8B]": 939,
+      "[8Be]": 940,
+      "[8He]": 941,
+      "[90Mo]": 942,
+      "[90Y+3]": 943,
+      "[90Zr]": 944,
+      "[91Y+3]": 945,
+      "[92Mo]": 946,
+      "[92Sr]": 947,
+      "[93Mo]": 948,
+      "[93Zr]": 949,
+      "[94Zr]": 950,
+      "[95Mo]": 951,
+      "[95Zr]": 952,
+      "[96Mo]": 953,
+      "[97Mo]": 954,
+      "[97Zr]": 955,
+      "[98Mo]": 956,
+      "[99Mo]": 957,
+      "[99Ru+2]": 958,
+      "[9B]": 959,
+      "[9Be]": 960,
+      "[=11NH1]": 961,
+      "[=12O]": 962,
+      "[=13C-1]": 963,
+      "[=14NH1]": 964,
+      "[=16N]": 965,
+      "[=19O]": 966,
+      "[=25O]": 967,
+      "[=77Se]": 968,
+      "[=8CH1]": 969,
+      "[=Al-1]": 970,
+      "[=AsH2]": 971,
+      "[=Ba]": 972,
+      "[=Be]": 973,
+      "[=Cd]": 974,
+      "[=Fe]": 975,
+      "[=Hg]": 976,
+      "[=In]": 977,
+      "[=Mo+4]": 978,
+      "[=Rh]": 979,
+      "[=SH1-1]": 980,
+      "[=Si+1]": 981,
+      "[=Si-1]": 982,
+      "[=SiH1+1]": 983,
+      "[=TeH1]": 984,
+      "[=Ti+1]": 985,
+      "[AlH6-3]": 986,
+      "[As+3]": 987,
+      "[AsH1+1]": 988,
+      "[AsH5]": 989,
+      "[Au+3]": 990,
+      "[Bi+2]": 991,
+      "[Bi+3]": 992,
+      "[Branch3]": 993,
+      "[CH3]": 994,
+      "[Cr+4]": 995,
+      "[CuH1]": 996,
+      "[Fe+4]": 997,
+      "[Gd+2]": 998,
+      "[Ge+4]": 999,
+      "[Ge-1]": 1000,
+      "[Ge@@H1]": 1001,
+      "[Ge@@]": 1002,
+      "[Ge@]": 1003,
+      "[InH3]": 1004,
+      "[Ir+3]": 1005,
+      "[Mn]": 1006,
+      "[Mo+2]": 1007,
+      "[Nb+3]": 1008,
+      "[Pt+4]": 1009,
+      "[Re]": 1010,
+      "[Rh-3]": 1011,
+      "[RhH1+2]": 1012,
+      "[Ru+4]": 1013,
+      "[Ru-2]": 1014,
+      "[RuH1+3]": 1015,
+      "[RuH4]": 1016,
+      "[S@@H1]": 1017,
+      "[SbH3]": 1018,
+      "[SbH5]": 1019,
+      "[SeH2]": 1020,
+      "[Si+2]": 1021,
+      "[SiH2-1]": 1022,
+      "[SiH4-1]": 1023,
+      "[Sr]": 1024,
+      "[TeH3]": 1025,
+      "[TeH4]": 1026,
+      "[TlH2]": 1027,
+      "[Tl]": 1028,
+      "[W]": 1029,
+      "[Xe]": 1030,
+      "[ZnH1+1]": 1031
+    },
+    "unk_token": "[UNK]"
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}