Initial commit

c2bfc1b verified 3 days ago

1.6 kB

language: en
library_name: transformers
tags:
  - tokenizer
  - smiles
  - protein
  - molecule-to-protein
license: apache-2.0

Mol2Pro-tokenizer

Paper: `Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`

Tokenizer description

This repository provides the paired tokenizers used by Mol2Pro models:

smiles/: tokenizer for molecule inputs (SMILES) used on the encoder side.
aa/: tokenizer for protein sequence outputs used on the decoder side.

The two tokenizers are designed to be used together with the Mol2Pro sequence-to-sequence checkpoints

Offset vocabulary

Mol2Pro uses an offset token-id scheme so that SMILES tokens and amino-acid tokens do not collide in id space. Avoids sharing embeddings for identical token strings.

The AA tokenizer uses its natural token id space.
The SMILES tokenizer vocabulary ids are offset above the AA vocabulary ids.

How to use

from transformers import AutoTokenizer

tokenizer_id = "contributor-anonymous/Mol2Pro-tokenizer"

# Load tokenizers
tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles")
tokenizer_aa  = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa")

# Example:
smiles = "CCO"
enc = tokenizer_mol(smiles, return_tensors="pt")
print("Encoder token ids:", enc.input_ids[0].tolist())
print("Encoder tokens:", tokenizer_mol.convert_ids_to_tokens(enc.input_ids[0]))

aa_text = tokenizer_aa.decode([0, 1, 2], skip_special_tokens=True)
print("Decoded protein sequence:", decoded)

Mol2Pro-tokenizer

Paper: Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data

Tokenizer description

Offset vocabulary

How to use

Paper: `Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`