ApeTokenizer-SMILES

ApeTokenizer-SMILES is an Atom Pair Encoding (APE) tokenizer for SMILES strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is the SMILES counterpart to ApeTokenizer-SELFIES, released alongside ModernMolBERT.

APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms, bonds, ring and branch symbols). Merges are frequency-driven, so a single token may span structural boundaries (for example (=O)N(), matching the APE behaviour described in Leon et al. — token boundaries are not guaranteed to be balanced sub-structures.

Tokenizer Details

Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
Input representation: SMILES (canonical)
Algorithm: Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
Vocabulary size: 1386
Max merge pieces: 6
Min merge frequency: 3000
Training corpus size: 2M unique canonical SMILES (ChEMBL 36)
License: MIT
Repository: https://github.com/HauserGroup/ModernMolBERT

special token	id
`<s>` (BOS)	0
`<pad>`	1
`</s>` (EOS)	2
`<unk>`	3
`<mask>`	4

How to Get Started

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "HauserGroup/ApeTokenizer-SMILES",
    trust_remote_code=True,
    use_fast=False,
)

# A canonical SMILES — here aspirin.
smiles = "CC(=O)Oc1ccccc1C(=O)O"

tokens = tokenizer.tokenize(smiles)
print(tokens)
# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']

inputs = tokenizer(smiles, return_tensors="pt")
print(inputs["input_ids"])

Feed SMILES directly — no SELFIES conversion is needed. This is a standalone alternative tokenizer and is not the tokenizer used by the released SELFIES-based ModernMolBERT checkpoints.

Citation

@article{madsen_modernmolbert,
  title  = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
  author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
  year   = {2026}
}

The APE algorithm follows Leon et al., Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling, Sci. Rep. 14, 25016 (2024).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support