IlPakoZ
/

m5-encoder

@@ -56,6 +56,7 @@ selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)
 encoding    = tokenizer(selfies, return_tensors="pt")
 input_ids   = encoding["input_ids"]
 attn_mask   = encoding["attention_mask"]
 rel_pos     = torch.tensor(pos_encod).unsqueeze(0)   # (1, seq_len, seq_len)
 outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
@@ -155,6 +156,7 @@ The HDF5 files are available for download below. These are intended to be proces
 ## Limitations
-- **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix. Consequently, molecules whose SELFIES tokenization exceeds **32,767 tokens** (`numpy.iinfo(numpy.int16).max`) are not supported. In practice, no molecule in the training dataset approaches this limit.
 - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
-- **Scope:** The model is pretrained on organic molecules present in PubChemQC. Performance on inorganic compounds, organometallics, or very large macromolecules outside the training distribution has not been evaluated.

 encoding    = tokenizer(selfies, return_tensors="pt")
 input_ids   = encoding["input_ids"]
 attn_mask   = encoding["attention_mask"]
 rel_pos     = torch.tensor(pos_encod).unsqueeze(0)   # (1, seq_len, seq_len)
 outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
 ## Limitations
+- **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
+This was done to decrease the size of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will be well below this limit.
 - **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
+- **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.

modeling_m5_encoder.py CHANGED Viewed

@@ -33,8 +33,8 @@ class M5EncoderConfig(T5Config):
         dropout_rate = 0,
         feed_forward_proj = "gated-gelu",
         classifier_dropout=0,
-        relative_attention_max_distance=128,
-        relative_attention_num_buckets=48,
         vocab_size=1032,
         num_decoder_layers=0,
         **kwargs,
@@ -263,12 +263,11 @@ class M5EncoderModel(T5EncoderModel):
                 input_ids=input_ids,
                 attention_mask=attention_mask,
                 inputs_embeds=inputs_embeds,
                 head_mask=head_mask,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
-                relative_position=relative_position
             )
             return encoder_outputs

         dropout_rate = 0,
         feed_forward_proj = "gated-gelu",
         classifier_dropout=0,
+        relative_attention_max_distance=96,
+        relative_attention_num_buckets=32,
         vocab_size=1032,
         num_decoder_layers=0,
         **kwargs,
                 input_ids=input_ids,
                 attention_mask=attention_mask,
                 inputs_embeds=inputs_embeds,
                 head_mask=head_mask,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
+                relative_position=relative_position.to(dtype=torch.int32) if relative_position is not None else None
             )
             return encoder_outputs