File size: 7,807 Bytes
12cd9ef 9db47da 12cd9ef c39a9ed 12cd9ef 01a9e83 9f4d3e3 01a9e83 44f00d0 01a9e83 ab04da1 01a9e83 c39a9ed 12cd9ef ab04da1 12cd9ef c39a9ed 12cd9ef 01a9e83 12cd9ef c39a9ed 12cd9ef c39a9ed 12cd9ef 44f00d0 c39a9ed 12cd9ef c39a9ed | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
library_name: transformers
tags:
- chemistry
- molecular-property-prediction
- selfies
- encoder
license: apache-2.0
---
# M5 Encoder
A SELFIES-based molecular encoder built on a T5 backbone with custom
distance-aware relative position encodings. Two classes are available:
| Class | Description |
|---|---|
| `M5Encoder` | Bare encoder, outputs `last_hidden_state` |
| `M5ModelForRegression` | Encoder + sequence-level and token-level regression heads|
The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks
from the [PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).
## Requirements
**This model was tested and implemented with Transformers version 4.51.3, so issues might appear in other versions.**
## Usage
```python
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
model = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
```
To load `M5ModelForRegression` explicitly:
```python
from transformers import AutoModelForSequenceClassification
regression_model = AutoModelForSequenceClassification.from_pretrained(
"IlPakoZ/m5-encoder", trust_remote_code=True
)
```
### Preparing inputs
Inputs require SELFIES tokenization **and** a precomputed distance matrix
(`relative_position`). Use the helper bundled in the repo:
```python
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
smiles = "CCO"
# seed = 0 produces the canonical SELFIES, other values generate random reproducible variations
selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)
encoding = tokenizer(selfies, return_tensors="pt")
input_ids = encoding["input_ids"]
attn_mask = encoding["attention_mask"]
rel_pos = torch.tensor(pos_encod).unsqueeze(0) # (1, seq_len, seq_len)
outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
hidden = outputs.last_hidden_state # (1, seq_len, 512)
```
A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
- the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
- the second element contains the positional embedding matrix;
- (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to ``None`` in most circumstances.
## Architecture
| Hyper-parameter | Value |
|---|---|
| `d_model` | 512 |
| `d_ff` | 2048 |
| `d_kv` | 64 |
| `num_layers` | 24 |
| `num_heads` | 12 |
| `vocab_size` | 1 032 |
| `feed_forward_proj` | gated-gelu |
| `relative_attention_num_buckets` | 32 |
| `relative_attention_max_distance` | 96 |
Position biases are replaced by molecular-graph distances computed
with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.
## Tasks
Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:
### Group 0 — General molecular descriptors (RDKit)
| Task | Description |
|---|---|
| `MW` | Molecular weight |
| `TDM` | Total dipole moment |
### Group 1 — Physicochemical properties (RDKit)
| Task | Description |
|---|---|
| `MolLogP` | Wildman-Crippen LogP estimate |
| `MolMR` | Wildman-Crippen molar refractivity |
| `TPSA` | Topological polar surface area |
| `FractionCSP3` | Fraction of sp³ carbons |
### Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)
Alpha and beta spin-orbital energies from DFT calculations:
| Task | Description |
|---|---|
| `energy_alpha_homo` | Alpha HOMO energy |
| `energy_alpha_gap` | Alpha HOMO–LUMO gap |
| `energy_alpha_lumo` | Alpha LUMO energy |
| `energy_beta_homo` | Beta HOMO energy |
| `energy_beta_gap` | Beta HOMO–LUMO gap |
| `energy_beta_lumo` | Beta LUMO energy |
### Group 3 — Orbital energies (PubChemQC B3LYP/PM6)
50 linearly sampled energies (`orbital_0` … `orbital_49`) spanning each molecule's full orbital spectrum, predicted at the sequence level.
### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)
Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to `lowdin_149`.
## Dataset
The model is pretrained on a processed version of the
[PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).
The raw database exposes a `b3lyp_pm6` table (columns: `cid`, `state`, `data` as JSON). Data was extracted,
invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate
SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers
and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural
representations. Molecules incompatible with strict SELFIES encoding were discarded.
The processed dataset contains **82,686,706 SMILES sequences**, each paired with a full set of labels across all tasks. It is split by scaffold:
| Split | Sequences | Tokens (approx.) |
|---|---|---|
| Train | 66,149,364 | ~2.5 B (×2 with augmentation → ~5 B) |
| Validation | 8,268,673 | tbd |
| Test | 8,268,669 | ~ 0.82 B (×2 with augmentation → ~1.64 B) |
Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method `get_positional_encodings_and_align` bundled in the model. Labels are normalized before training.
The HDF5 files containing the data used for training are available for download below (**coming soon**). These files are used for the training training our model, but are first converted into .lmdb format through the `data_processing` library in our GitHub repository (**coming soon**) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.
| Split | Download |
|---|---|
| Train | [train.h5](#) |
| Validation | [validation.h5](#) |
| Test | [test.h5](#) |
## Limitations
- **Token length:** The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds **32,766 tokens** (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will lie well below this limit.
- **Conformer handling:** Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
- **Scope:** The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of **MW <= 1000** or **number of heavy atoms <= 79**.
|