Edited README.md

9db47da about 1 hour ago

7.81 kB

library_name: transformers
tags:
  - chemistry
  - molecular-property-prediction
  - selfies
  - encoder
license: apache-2.0

M5 Encoder

A SELFIES-based molecular encoder built on a T5 backbone with custom distance-aware relative position encodings. Two classes are available:

Class	Description
`M5Encoder`	Bare encoder, outputs `last_hidden_state`
`M5ModelForRegression`	Encoder + sequence-level and token-level regression heads

The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks from the PubChemQC B3LYP/PM6 dataset.

Requirements

This model was tested and implemented with Transformers version 4.51.3, so issues might appear in other versions.

Usage

from transformers import AutoConfig, AutoModel

config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
model  = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)

To load M5ModelForRegression explicitly:

from transformers import AutoModelForSequenceClassification

regression_model = AutoModelForSequenceClassification.from_pretrained(
    "IlPakoZ/m5-encoder", trust_remote_code=True
)

Preparing inputs

Inputs require SELFIES tokenization and a precomputed distance matrix (relative_position). Use the helper bundled in the repo:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)

smiles = "CCO"

# seed = 0 produces the canonical SELFIES, other values generate random reproducible variations
selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)

encoding    = tokenizer(selfies, return_tensors="pt")
input_ids   = encoding["input_ids"]
attn_mask   = encoding["attention_mask"]

rel_pos     = torch.tensor(pos_encod).unsqueeze(0)   # (1, seq_len, seq_len)

outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
hidden  = outputs.last_hidden_state   # (1, seq_len, 512)

A function model.collate_for_dataset is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:

the first element is a dictionary with keys "input_ids" (np.ndarray, shape (L,)) and "attention_mask" (np.ndarray, shape (L,)), as produced by a tokenizer
the second element contains the positional embedding matrix;
(optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to None in most circumstances.

Architecture

Hyper-parameter	Value
`d_model`	512
`d_ff`	2048
`d_kv`	64
`num_layers`	24
`num_heads`	12
`vocab_size`	1 032
`feed_forward_proj`	gated-gelu
`relative_attention_num_buckets`	32
`relative_attention_max_distance`	96

Position biases are replaced by molecular-graph distances computed with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.

Tasks

Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:

Group 0 — General molecular descriptors (RDKit)

Task	Description
`MW`	Molecular weight
`TDM`	Total dipole moment

Group 1 — Physicochemical properties (RDKit)

Task	Description
`MolLogP`	Wildman-Crippen LogP estimate
`MolMR`	Wildman-Crippen molar refractivity
`TPSA`	Topological polar surface area
`FractionCSP3`	Fraction of sp³ carbons

Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)

Alpha and beta spin-orbital energies from DFT calculations:

Task	Description
`energy_alpha_homo`	Alpha HOMO energy
`energy_alpha_gap`	Alpha HOMO–LUMO gap
`energy_alpha_lumo`	Alpha LUMO energy
`energy_beta_homo`	Beta HOMO energy
`energy_beta_gap`	Beta HOMO–LUMO gap
`energy_beta_lumo`	Beta LUMO energy

Group 3 — Orbital energies (PubChemQC B3LYP/PM6)

50 linearly sampled energies (orbital_0 … orbital_49) spanning each molecule's full orbital spectrum, predicted at the sequence level.

Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)

Up to 1023 partial charges (lowdin_0 … lowdin_1022), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to lowdin_149.

Dataset

The model is pretrained on a processed version of the PubChemQC B3LYP/PM6 dataset. The raw database exposes a b3lyp_pm6 table (columns: cid, state, data as JSON). Data was extracted, invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural representations. Molecules incompatible with strict SELFIES encoding were discarded.

The processed dataset contains 82,686,706 SMILES sequences, each paired with a full set of labels across all tasks. It is split by scaffold:

Split	Sequences	Tokens (approx.)
Train	66,149,364	~2.5 B (×2 with augmentation → ~5 B)
Validation	8,268,673	tbd
Test	8,268,669	~ 0.82 B (×2 with augmentation → ~1.64 B)

Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method get_positional_encodings_and_align bundled in the model. Labels are normalized before training.

The HDF5 files containing the data used for training are available for download below (coming soon). These files are used for the training training our model, but are first converted into .lmdb format through the data_processing library in our GitHub repository (coming soon) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.

Split	Download
Train	train.h5
Validation	validation.h5
Test	test.h5

Limitations

Token length: The built-in prepare_data helper encodes pairwise molecular-graph distances in an int16 matrix. This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the prepare_data limitations, molecules whose SELFIES tokenization exceeds 32,766 tokens (numpy.iinfo(numpy.int16).max - 1) are not supported. In practice, most molecule will lie well below this limit.
Conformer handling: Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
Scope: The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.