AI4PD
/

Mol2Pro-tokenizer

molecule-to-protein

Model card Files Files and versions

VicenteAlex commited on 2 days ago

Commit

db0d949

·

verified ·

1 Parent(s): f2973dd

Init tokenizer card

Files changed (1) hide show

README.md +66 -3

README.md CHANGED Viewed

@@ -1,3 +1,66 @@
----
-license: apache-2.0
----

+---
+language: en
+library_name: transformers
+tags:
+ - tokenizer
+ - smiles
+ - protein
+ - molecule-to-protein
+license: apache-2.0
+---
+# Mol2Pro-tokenizer
+#### Paper: [`Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`](https://doi.org/10.64898/2026.02.06.704305)
+## Tokenizer description
+This repository provides the **paired tokenizers** used by Mol2Pro models:
+- **`smiles/`**: tokenizer for molecule inputs (SMILES) used on the **encoder** side.
+- **`aa/`**: tokenizer for protein sequence outputs used on the **decoder** side.
+The two tokenizers are designed to be used together with the Mol2Pro sequence-to-sequence checkpoints (see the model card: [`AI4PD/Mol2Pro-base`](https://huggingface.co/AI4PD/Mol2Pro-base)).
+## Offset vocabulary
+Mol2Pro uses an offset token-id scheme so that SMILES tokens and amino-acid tokens do not collide in id space. Avoids sharing embeddings for identical token strings.
+- The **AA** tokenizer uses its natural token id space.
+- The **SMILES** tokenizer vocabulary ids are offset above the AA vocabulary ids.
+## How to use
+```python
+from transformers import AutoTokenizer
+tokenizer_id = "AI4PD/Mol2Pro-tokenizer"
+# Load tokenizers
+tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles")
+tokenizer_aa  = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa")
+# Example:
+smiles = "CCO"
+enc = tokenizer_mol(smiles, return_tensors="pt")
+print("Encoder token ids:", enc.input_ids[0].tolist())
+print("Encoder tokens:", tokenizer_mol.convert_ids_to_tokens(enc.input_ids[0]))
+aa_text = tokenizer_aa.decode([0, 1, 2], skip_special_tokens=True)
+print("Decoded protein sequence:", decoded)
+```
+## Citation
+If you find this work useful, please cite:
+```bibtex
+@article{VicenteSola2026Generalise,
+  title   = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data},
+  author  = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia},
+  journal = {bioRxiv},
+  year    = {2026},
+  doi     = {10.64898/2026.02.06.704305},
+}
+```