Mol2Pro-tokenizer

Paper: `Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`

Tokenizer description

This repository provides the paired tokenizers used by Mol2Pro models:

smiles/: tokenizer for molecule inputs (SMILES) used on the encoder side.
aa/: tokenizer for protein sequence outputs used on the decoder side.

The two tokenizers are designed to be used together with the Mol2Pro sequence-to-sequence checkpoints (see the model card: AI4PD/Mol2Pro-base).

Offset vocabulary

Mol2Pro uses an offset token-id scheme so that SMILES tokens and amino-acid tokens do not collide in id space. Avoids sharing embeddings for identical token strings.

The AA tokenizer uses its natural token id space.
The SMILES tokenizer vocabulary ids are offset above the AA vocabulary ids.

How to use

from transformers import AutoTokenizer

tokenizer_id = "AI4PD/Mol2Pro-tokenizer"

# Load tokenizers
tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles")
tokenizer_aa  = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa")

# Example:
smiles = "CCO"
enc = tokenizer_mol(smiles, return_tensors="pt")
print("Encoder token ids:", enc.input_ids[0].tolist())
print("Encoder tokens:", tokenizer_mol.convert_ids_to_tokens(enc.input_ids[0]))

aa_text = tokenizer_aa.decode([0, 1, 2], skip_special_tokens=True)
print("Decoded protein sequence:", decoded)

Citation

If you find this work useful, please cite:

@article{VicenteSola2026Generalise,
  title   = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data},
  author  = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.02.06.704305},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AI4PD/Mol2Pro-tokenizer

Mol2Pro Family

Collection

Family of models and datasets presented in "Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data" • 3 items • Updated Feb 10 • 2