m5-encoder / README.md
IlPakoZ's picture
Edited README.md
9db47da
metadata
library_name: transformers
tags:
  - chemistry
  - molecular-property-prediction
  - selfies
  - encoder
license: apache-2.0

M5 Encoder

A SELFIES-based molecular encoder built on a T5 backbone with custom distance-aware relative position encodings. Two classes are available:

Class Description
M5Encoder Bare encoder, outputs last_hidden_state
M5ModelForRegression Encoder + sequence-level and token-level regression heads

The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks from the PubChemQC B3LYP/PM6 dataset.

Requirements

This model was tested and implemented with Transformers version 4.51.3, so issues might appear in other versions.

Usage

from transformers import AutoConfig, AutoModel

config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
model  = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)

To load M5ModelForRegression explicitly:

from transformers import AutoModelForSequenceClassification

regression_model = AutoModelForSequenceClassification.from_pretrained(
    "IlPakoZ/m5-encoder", trust_remote_code=True
)

Preparing inputs

Inputs require SELFIES tokenization and a precomputed distance matrix (relative_position). Use the helper bundled in the repo:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)

smiles = "CCO"

# seed = 0 produces the canonical SELFIES, other values generate random reproducible variations
selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)

encoding    = tokenizer(selfies, return_tensors="pt")
input_ids   = encoding["input_ids"]
attn_mask   = encoding["attention_mask"]

rel_pos     = torch.tensor(pos_encod).unsqueeze(0)   # (1, seq_len, seq_len)

outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
hidden  = outputs.last_hidden_state   # (1, seq_len, 512)

A function model.collate_for_dataset is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:

  • the first element is a dictionary with keys "input_ids" (np.ndarray, shape (L,)) and "attention_mask" (np.ndarray, shape (L,)), as produced by a tokenizer
  • the second element contains the positional embedding matrix;
  • (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to None in most circumstances.

Architecture

Hyper-parameter Value
d_model 512
d_ff 2048
d_kv 64
num_layers 24
num_heads 12
vocab_size 1 032
feed_forward_proj gated-gelu
relative_attention_num_buckets 32
relative_attention_max_distance 96

Position biases are replaced by molecular-graph distances computed with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.

Tasks

Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:

Group 0 — General molecular descriptors (RDKit)

Task Description
MW Molecular weight
TDM Total dipole moment

Group 1 — Physicochemical properties (RDKit)

Task Description
MolLogP Wildman-Crippen LogP estimate
MolMR Wildman-Crippen molar refractivity
TPSA Topological polar surface area
FractionCSP3 Fraction of sp³ carbons

Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)

Alpha and beta spin-orbital energies from DFT calculations:

Task Description
energy_alpha_homo Alpha HOMO energy
energy_alpha_gap Alpha HOMO–LUMO gap
energy_alpha_lumo Alpha LUMO energy
energy_beta_homo Beta HOMO energy
energy_beta_gap Beta HOMO–LUMO gap
energy_beta_lumo Beta LUMO energy

Group 3 — Orbital energies (PubChemQC B3LYP/PM6)

50 linearly sampled energies (orbital_0orbital_49) spanning each molecule's full orbital spectrum, predicted at the sequence level.

Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)

Up to 1023 partial charges (lowdin_0lowdin_1022), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to lowdin_149.

Dataset

The model is pretrained on a processed version of the PubChemQC B3LYP/PM6 dataset. The raw database exposes a b3lyp_pm6 table (columns: cid, state, data as JSON). Data was extracted, invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural representations. Molecules incompatible with strict SELFIES encoding were discarded.

The processed dataset contains 82,686,706 SMILES sequences, each paired with a full set of labels across all tasks. It is split by scaffold:

Split Sequences Tokens (approx.)
Train 66,149,364 ~2.5 B (×2 with augmentation → ~5 B)
Validation 8,268,673 tbd
Test 8,268,669 ~ 0.82 B (×2 with augmentation → ~1.64 B)

Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method get_positional_encodings_and_align bundled in the model. Labels are normalized before training.

The HDF5 files containing the data used for training are available for download below (coming soon). These files are used for the training training our model, but are first converted into .lmdb format through the data_processing library in our GitHub repository (coming soon) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.

Split Download
Train train.h5
Validation validation.h5
Test test.h5

Limitations

  • Token length: The built-in prepare_data helper encodes pairwise molecular-graph distances in an int16 matrix. This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the prepare_data limitations, molecules whose SELFIES tokenization exceeds 32,766 tokens (numpy.iinfo(numpy.int16).max - 1) are not supported. In practice, most molecule will lie well below this limit.
  • Conformer handling: Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
  • Scope: The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.