ApeTokenizer-SMILES

ApeTokenizer-SMILES is an Atom Pair Encoding (APE) tokenizer for SMILES strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is the SMILES counterpart to ApeTokenizer-SELFIES, released alongside ModernMolBERT.

APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms, bonds, ring and branch symbols). Merges are frequency-driven, so a single token may span structural boundaries (for example (=O)N(), matching the APE behaviour described in Leon et al. — token boundaries are not guaranteed to be balanced sub-structures.

Tokenizer Details

  • Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
  • Input representation: SMILES (canonical)
  • Algorithm: Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
  • Vocabulary size: 1386
  • Max merge pieces: 6
  • Min merge frequency: 3000
  • Training corpus size: 2M unique canonical SMILES (ChEMBL 36)
  • License: MIT
  • Repository: https://github.com/HauserGroup/ModernMolBERT
special token id
<s> (BOS) 0
<pad> 1
</s> (EOS) 2
<unk> 3
<mask> 4

How to Get Started

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "HauserGroup/ApeTokenizer-SMILES",
    trust_remote_code=True,
    use_fast=False,
)

# A canonical SMILES — here aspirin.
smiles = "CC(=O)Oc1ccccc1C(=O)O"

tokens = tokenizer.tokenize(smiles)
print(tokens)
# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']

inputs = tokenizer(smiles, return_tensors="pt")
print(inputs["input_ids"])

Feed SMILES directly — no SELFIES conversion is needed. This is a standalone alternative tokenizer and is not the tokenizer used by the released SELFIES-based ModernMolBERT checkpoints.

Citation

@article{madsen_modernmolbert,
  title  = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
  author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
  year   = {2026}
}

The APE algorithm follows Leon et al., Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling, Sci. Rep. 14, 25016 (2024).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support