Mol2Pro-tokenizer

Paper: `Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`

Tokenizer description

This repository provides the paired tokenizers used by Mol2Pro models:

smiles/: tokenizer for molecule inputs (SMILES) used on the encoder side.
aa/: tokenizer for protein sequence outputs used on the decoder side.

The two tokenizers are designed to be used together with the Mol2Pro sequence-to-sequence checkpoints

Offset vocabulary

Mol2Pro uses an offset token-id scheme so that SMILES tokens and amino-acid tokens do not collide in id space. Avoids sharing embeddings for identical token strings.

The AA tokenizer uses its natural token id space.
The SMILES tokenizer vocabulary ids are offset above the AA vocabulary ids.

How to use

from transformers import AutoTokenizer

tokenizer_id = "contributor-anonymous/Mol2Pro-tokenizer"

# Load tokenizers
tokenizer_mol = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="smiles")
tokenizer_aa  = AutoTokenizer.from_pretrained(tokenizer_id, subfolder="aa")

# Example:
smiles = "CCO"
enc = tokenizer_mol(smiles, return_tensors="pt")
print("Encoder token ids:", enc.input_ids[0].tolist())
print("Encoder tokens:", tokenizer_mol.convert_ids_to_tokens(enc.input_ids[0]))

aa_text = tokenizer_aa.decode([0, 1, 2], skip_special_tokens=True)
print("Decoded protein sequence:", decoded)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Mol2Pro-tokenizer

Paper: Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data

Tokenizer description

Offset vocabulary

How to use

Paper: `Generalise or Memorise? Benchmarking Ligand-Conditioned Protein Generation from Sequence-Only Data`