Instructions to use HauserGroup/ApeTokenizer-SMILES with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ApeTokenizer-SMILES with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HauserGroup/ApeTokenizer-SMILES", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ApeTokenizer-SMILES
ApeTokenizer-SMILES is an Atom Pair Encoding (APE) tokenizer for SMILES strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is the SMILES counterpart to ApeTokenizer-SELFIES, released alongside ModernMolBERT.
APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms,
bonds, ring and branch symbols). Merges are frequency-driven, so a single token
may span structural boundaries (for example (=O)N(), matching the APE
behaviour described in Leon et al. — token boundaries are not guaranteed to be
balanced sub-structures.
Tokenizer Details
- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- Input representation: SMILES (canonical)
- Algorithm: Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
- Vocabulary size: 1386
- Max merge pieces: 6
- Min merge frequency: 3000
- Training corpus size: 2M unique canonical SMILES (ChEMBL 36)
- License: MIT
- Repository: https://github.com/HauserGroup/ModernMolBERT
| special token | id |
|---|---|
<s> (BOS) |
0 |
<pad> |
1 |
</s> (EOS) |
2 |
<unk> |
3 |
<mask> |
4 |
How to Get Started
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"HauserGroup/ApeTokenizer-SMILES",
trust_remote_code=True,
use_fast=False,
)
# A canonical SMILES — here aspirin.
smiles = "CC(=O)Oc1ccccc1C(=O)O"
tokens = tokenizer.tokenize(smiles)
print(tokens)
# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']
inputs = tokenizer(smiles, return_tensors="pt")
print(inputs["input_ids"])
Feed SMILES directly — no SELFIES conversion is needed. This is a standalone alternative tokenizer and is not the tokenizer used by the released SELFIES-based ModernMolBERT checkpoints.
Citation
@article{madsen_modernmolbert,
title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
year = {2026}
}
The APE algorithm follows Leon et al., Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling, Sci. Rep. 14, 25016 (2024).