VicenteAlex commited on
Commit
db0d949
·
verified ·
1 Parent(s): f2973dd

Init tokenizer card

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: transformers
4
+ tags:
5
+ - tokenizer
6
+ - smiles
7
+ - protein
8
+ - molecule-to-protein
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # Mol2Pro-tokenizer
13
+ #### Paper: [`Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`](https://doi.org/10.64898/2026.02.06.704305)
14
+
15
+ ## Tokenizer description
16
+
17
+ This repository provides the **paired tokenizers** used by Mol2Pro models:
18
+
19
+
20
+ - **`smiles/`**: tokenizer for molecule inputs (SMILES) used on the **encoder** side.
21
+ - **`aa/`**: tokenizer for protein sequence outputs used on the **decoder** side.
22
+
23
+ The two tokenizers are designed to be used together with the Mol2Pro sequence-to-sequence checkpoints (see the model card: [`AI4PD/Mol2Pro-base`](https://huggingface.co/AI4PD/Mol2Pro-base)).
24
+
25
+ ## Offset vocabulary
26
+
27
+ Mol2Pro uses an offset token-id scheme so that SMILES tokens and amino-acid tokens do not collide in id space. Avoids sharing embeddings for identical token strings.
28
+
29
+ - The **AA** tokenizer uses its natural token id space.
30
+ - The **SMILES** tokenizer vocabulary ids are offset above the AA vocabulary ids.
31
+
32
+ ## How to use
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer
36
+
37
+ tokenizer_id = "AI4PD/Mol2Pro-tokenizer"
38
+
39
+ # Load tokenizers
40
+ tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles")
41
+ tokenizer_aa = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa")
42
+
43
+ # Example:
44
+ smiles = "CCO"
45
+ enc = tokenizer_mol(smiles, return_tensors="pt")
46
+ print("Encoder token ids:", enc.input_ids[0].tolist())
47
+ print("Encoder tokens:", tokenizer_mol.convert_ids_to_tokens(enc.input_ids[0]))
48
+
49
+ aa_text = tokenizer_aa.decode([0, 1, 2], skip_special_tokens=True)
50
+ print("Decoded protein sequence:", decoded)
51
+ ```
52
+
53
+ ## Citation
54
+
55
+ If you find this work useful, please cite:
56
+
57
+ ```bibtex
58
+ @article{VicenteSola2026Generalise,
59
+ title = {Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data},
60
+ author = {Vicente-Sola, Alex and Dornfeld, Lars and Coines, Joan and Ferruz, Noelia},
61
+ journal = {bioRxiv},
62
+ year = {2026},
63
+ doi = {10.64898/2026.02.06.704305},
64
+ }
65
+ ```
66
+