Edited README.md

9db47da about 3 hours ago

7.81 kB

	---
	library_name: transformers
	tags:
	- chemistry
	- molecular-property-prediction
	- selfies
	- encoder
	license: apache-2.0
	---

	# M5 Encoder

	A SELFIES-based molecular encoder built on a T5 backbone with custom
	distance-aware relative position encodings. Two classes are available:

	\| Class \| Description \|
	\|---\|---\|
	\| `M5Encoder` \| Bare encoder, outputs `last_hidden_state` \|
	\| `M5ModelForRegression` \| Encoder + sequence-level and token-level regression heads\|

	The model is pretrained on multi-task regression tasks, including quantum chemistry (QC) tasks
	from the [PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).

	## Requirements

	This model was tested and implemented with Transformers version 4.51.3, so issues might appear in other versions.

	## Usage

	```python
	from transformers import AutoConfig, AutoModel

	config = AutoConfig.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
	model = AutoModel.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)
	```

	To load `M5ModelForRegression` explicitly:

	```python
	from transformers import AutoModelForSequenceClassification

	regression_model = AutoModelForSequenceClassification.from_pretrained(
	"IlPakoZ/m5-encoder", trust_remote_code=True
	)
	```

	### Preparing inputs

	Inputs require SELFIES tokenization and a precomputed distance matrix
	(`relative_position`). Use the helper bundled in the repo:

	```python
	import torch
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("IlPakoZ/m5-encoder", trust_remote_code=True)

	smiles = "CCO"

	# seed = 0 produces the canonical SELFIES, other values generate random reproducible variations
	selfies, pos_encod, _ = model.get_positional_encodings_and_align(smiles, seed=0)

	encoding = tokenizer(selfies, return_tensors="pt")
	input_ids = encoding["input_ids"]
	attn_mask = encoding["attention_mask"]

	rel_pos = torch.tensor(pos_encod).unsqueeze(0) # (1, seq_len, seq_len)

	outputs = model(input_ids=input_ids, attention_mask=attn_mask, relative_position=rel_pos)
	hidden = outputs.last_hidden_state # (1, seq_len, 512)
	```

	A function ``model.collate_for_dataset`` is also available to perform collation for use in Pytorch's DataLoader. The function gets a list of tuples, each of which is composed of:
	- the first element is a dictionary with keys ``"input_ids"`` (``np.ndarray``, shape ``(L,)``) and ``"attention_mask"`` (``np.ndarray``, shape ``(L,)``), as produced by a tokenizer
	- the second element contains the positional embedding matrix;
	- (optional) token regression labels. This is maintained mostly for reproducibility of our paper's results, but it can be left to ``None`` in most circumstances.

	## Architecture

	\| Hyper-parameter \| Value \|
	\|---\|---\|
	\| `d_model` \| 512 \|
	\| `d_ff` \| 2048 \|
	\| `d_kv` \| 64 \|
	\| `num_layers` \| 24 \|
	\| `num_heads` \| 12 \|
	\| `vocab_size` \| 1 032 \|
	\| `feed_forward_proj` \| gated-gelu \|
	\| `relative_attention_num_buckets` \| 32 \|
	\| `relative_attention_max_distance` \| 96 \|

	Position biases are replaced by molecular-graph distances computed
	with RDKit and binned with a modified T5 logarithm binning algorithm, giving the model awareness to molecular topology without being too strict on precise distances.

	## Tasks

	Pretraining consists of up to 1085 tasks across five regression heads. Tasks are grouped by source and prediction target:

	### Group 0 — General molecular descriptors (RDKit)

	\| Task \| Description \|
	\|---\|---\|
	\| `MW` \| Molecular weight \|
	\| `TDM` \| Total dipole moment \|

	### Group 1 — Physicochemical properties (RDKit)

	\| Task \| Description \|
	\|---\|---\|
	\| `MolLogP` \| Wildman-Crippen LogP estimate \|
	\| `MolMR` \| Wildman-Crippen molar refractivity \|
	\| `TPSA` \| Topological polar surface area \|
	\| `FractionCSP3` \| Fraction of sp³ carbons \|

	### Group 2 — Frontier orbital energies (PubChemQC B3LYP/PM6)

	Alpha and beta spin-orbital energies from DFT calculations:

	\| Task \| Description \|
	\|---\|---\|
	\| `energy_alpha_homo` \| Alpha HOMO energy \|
	\| `energy_alpha_gap` \| Alpha HOMO–LUMO gap \|
	\| `energy_alpha_lumo` \| Alpha LUMO energy \|
	\| `energy_beta_homo` \| Beta HOMO energy \|
	\| `energy_beta_gap` \| Beta HOMO–LUMO gap \|
	\| `energy_beta_lumo` \| Beta LUMO energy \|

	### Group 3 — Orbital energies (PubChemQC B3LYP/PM6)

	50 linearly sampled energies (`orbital_0` … `orbital_49`) spanning each molecule's full orbital spectrum, predicted at the sequence level.

	### Group 4 — Atom Löwdin charges (PubChemQC B3LYP/PM6)

	Up to 1023 partial charges (`lowdin_0` … `lowdin_1022`), one per atom, predicted using each atom's corresponding output token embedding. This head covers well beyond the maximum number of atoms observed in the dataset. In practice, our training set covers up to `lowdin_149`.

	## Dataset

	The model is pretrained on a processed version of the
	[PubChemQC B3LYP/PM6 dataset](https://nakatamaho.riken.jp/pubchemqc.riken.jp/b3lyp_pm6_datasets.html).
	The raw database exposes a `b3lyp_pm6` table (columns: `cid`, `state`, `data` as JSON). Data was extracted,
	invalid SMILES removed, relevant features selected, and saved in compressed HDF5 format. Duplicate
	SMILES were intentionally retained to allow the model to encounter molecules with multiple conformers
	and learn a soft compromise across them. This trades auxiliary-task accuracy for richer structural
	representations. Molecules incompatible with strict SELFIES encoding were discarded.

	The processed dataset contains 82,686,706 SMILES sequences, each paired with a full set of labels across all tasks. It is split by scaffold:

	\| Split \| Sequences \| Tokens (approx.) \|
	\|---\|---\|---\|
	\| Train \| 66,149,364 \| ~2.5 B (×2 with augmentation → ~5 B) \|
	\| Validation \| 8,268,673 \| tbd \|
	\| Test \| 8,268,669 \| ~ 0.82 B (×2 with augmentation → ~1.64 B) \|

	Training is performed with augmentation through SELFIES generated from randomly traversed versions of the original SMILES. This process is done by the method `get_positional_encodings_and_align` bundled in the model. Labels are normalized before training.

	The HDF5 files containing the data used for training are available for download below (coming soon). These files are used for the training training our model, but are first converted into .lmdb format through the `data_processing` library in our GitHub repository (coming soon) to ensure fast access and stop CPU bottlenecking. The resulting LMDB files are too large to distribute directly at the moment, as input pre-computation (relative position encodings, input ids, attention masks and regression labels with augmentation) is performed.

	\| Split \| Download \|
	\|---\|---\|
	\| Train \| [train.h5](#) \|
	\| Validation \| [validation.h5](#) \|
	\| Test \| [test.h5](#) \|

	## Limitations

	- Token length: The built-in `prepare_data` helper encodes pairwise molecular-graph distances in an `int16` matrix.
	This was done to decrease the memory footprint of pairwise-distance matrices in case one intends to pre-compute them before training. Due to the `prepare_data` limitations, molecules whose SELFIES tokenization exceeds 32,766 tokens (`numpy.iinfo(numpy.int16).max - 1`) are not supported. In practice, most molecule will lie well below this limit.
	- Conformer handling: Duplicate SMILES representing different conformers are kept in the dataset. The model therefore predicts an implicit average over conformers rather than a geometry-specific value, which may reduce accuracy for conformation-sensitive properties.
	- Scope: The model is pretrained on molecules present in PubChemQC. Performance on certain compounds types and large macromolecules outside the training distribution has not been evaluated. Therefore, the model will be stronger with molecules of MW <= 1000 or number of heavy atoms <= 79.