library_name: transformers
tags:
- chemistry
- molecular-property-prediction
- selfies
- encoder
license: apache-2.0
M5 Encoder
A SELFIES-based molecular encoder built on a T5 backbone with custom distance-aware relative position encodings. Two classes are available:
| Class | Description |
|---|---|
M5Encoder |
Bare encoder, outputs last_hidden_state |
M5ModelForRegression |
Encoder + sequence-level and token-level regression heads |
The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks from the PubChemQC B3LYP/PM6 dataset.
Requirements
This model was tested and implemented with Transformers version 4.51.3, so issues might appear in other versions.
Usage
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
model = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
To load M5ModelForRegression explicitly:
from transformers import AutoModelForSequenceClassification
regression_model = AutoModelForSequenceClassification.from_pretrained(
"IlPakoZ/m5-encoder", trust_remote_code=True
)
Preparing inputs
Inputs require SELFIES tokenization and a precomputed distance matrix
(relative_position). Use the helper bundled in the repo:
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
smiles = "CCO"
# seed = 0 produces the canonical SELFIES, other values generate random reproducible variations
selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)
encoding = tokenizer(selfies, return_tensors="pt")
input_ids = encoding["input_ids"]
attn_mask = encoding["attention_mask"]
rel_pos = torch.tensor(pos_encod).unsqueeze(0) # (1, seq_len, seq_len)
outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
hidden = outputs.last_hidden_state # (1, seq_len, 512)
A function model.collate_for_dataset is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
- the first element is a dictionary with keys
"input_ids"(np.ndarray, shape(L,)) and"attention_mask"(np.ndarray, shape(L,)), as produced by a tokenizer - the second element contains the positional embedding matrix;
- (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to
Nonein most circumstances.
Architecture
| Hyper-parameter | Value |
|---|---|
d_model |
512 |
d_ff |
2048 |
d_kv |
64 |
num_layers |
24 |
num_heads |
12 |
vocab_size |
1 032 |
feed_forward_proj |
gated-gelu |
relative_attention_num_buckets |
32 |
relative_attention_max_distance |
96 |
Position biases are replaced by molecular-graph distances computed with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.
Tasks
Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:
Group 0 — General molecular descriptors (RDKit)
| Task | Description |
|---|---|
MW |
Molecular weight |
TDM |
Total dipole moment |
Group 1 — Physicochemical properties (RDKit)
| Task | Description |
|---|---|
MolLogP |
Wildman-Crippen LogP estimate |
MolMR |
Wildman-Crippen molar refractivity |
TPSA |
Topological polar surface area |
FractionCSP3 |
Fraction of sp³ carbons |
Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)
Alpha and beta spin-orbital energies from DFT calculations:
| Task | Description |
|---|---|
energy_alpha_homo |
Alpha HOMO energy |
energy_alpha_gap |
Alpha HOMO–LUMO gap |
energy_alpha_lumo |
Alpha LUMO energy |
energy_beta_homo |
Beta HOMO energy |
energy_beta_gap |
Beta HOMO–LUMO gap |
energy_beta_lumo |
Beta LUMO energy |
Group 3 — Orbital energies (PubChemQC B3LYP/PM6)
50 linearly sampled energies (orbital_0 … orbital_49) spanning each molecule's full orbital spectrum, predicted at the sequence level.
Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
Up to 1023 partial charges (lowdin_0 … lowdin_1022), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to lowdin_149.
Dataset
The model is pretrained on a processed version of the
PubChemQC B3LYP/PM6 dataset.
The raw database exposes a b3lyp_pm6 table (columns: cid, state, data as JSON). Data was extracted,
invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate
SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers
and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural
representations. Molecules incompatible with strict SELFIES encoding were discarded.
The processed dataset contains 82,686,706 SMILES sequences, each paired with a full set of labels across all tasks. It is split by scaffold:
| Split | Sequences | Tokens (approx.) |
|---|---|---|
| Train | 66,149,364 | ~2.5 B (×2 with augmentation → ~5 B) |
| Validation | 8,268,673 | tbd |
| Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method get_positional_encodings_and_align bundled in the model. Labels are normalized before training.
The HDF5 files containing the data used for training are available for download below (coming soon). These files are used for the training training our model, but are first converted into .lmdb format through the data_processing library in our GitHub repository (coming soon) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.
| Split | Download |
|---|---|
| Train | train.h5 |
| Validation | validation.h5 |
| Test | test.h5 |
Limitations
- Token length: The built-in
prepare_datahelper encodes pairwise molecular-graph distances in anint16matrix. This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to theprepare_datalimitations, molecules whose SELFIES tokenization exceeds 32,766 tokens (numpy.iinfo(numpy.int16).max - 1) are not supported. In practice, most molecule will lie well below this limit. - Conformer handling: Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
- Scope: The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.