Instructions to use HauserGroup/ApeTokenizer-SELFIES with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ApeTokenizer-SELFIES with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HauserGroup/ApeTokenizer-SELFIES", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ApeTokenizer-SELFIES
ApeTokenizer-SELFIES is the Atom Pair Encoding (APE) tokenizer used by ModernMolBERT — a family of compact encoder-only transformer models for small-molecule representation learning pre-trained on SELFIES strings from ChEMBL 36.
APE is a byte-pair-style merging scheme applied directly to SELFIES bracket tokens, so every token boundary aligns with a chemically valid SELFIES primitive. The vocabulary is derived from ~2M unique SELFIES strings from ChEMBL 36.
Tokenizer Details
- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- Input representation: SELFIES (convert SMILES first; see below)
- Algorithm: Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens
- Vocabulary size: 631
- Max merge pieces: 2
- Min merge frequency: 3000
- Training corpus size: 2M unique SELFIES (ChEMBL 36)
- License: MIT
- Repository: https://github.com/HauserGroup/ModernMolBERT
| special token | id |
|---|---|
<s> (BOS) |
0 |
<pad> |
1 |
</s> (EOS) |
2 |
<unk> |
3 |
<mask> |
4 |
How to Get Started
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"HauserGroup/ApeTokenizer-SELFIES",
trust_remote_code=True,
use_fast=False,
)
# A SELFIES string — here aspirin.
selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"
tokens = tokenizer.tokenize(selfies)
print(tokens)
# ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]']
inputs = tokenizer(selfies, return_tensors="pt")
print(inputs["input_ids"])
# tensor([[ 0, 334, 335, 370, 333, 333, 333, 338, 377, 511, 6, 2]])
If you start from SMILES, convert first:
import selfies
smi = "CC(=O)Oc1ccccc1C(=O)O"
sf = selfies.encoder(smi) # '[C][C][=Branch1][C][=O][O][C]...'
inputs = tokenizer(sf, return_tensors="pt")
Using with ModernMolBERT models
This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from
the model repo using subfolder="ape_tokenizer" to avoid routing
AutoTokenizer to the built-in fast ModernBERT tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"HauserGroup/ModernMolBERT-small",
subfolder="ape_tokenizer",
trust_remote_code=True,
use_fast=False,
)
Or load this standalone repo directly as shown above — both produce identical tokenizations.
Citation
@article{madsen_modernmolbert,
title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
year = {2026}
}
The APE algorithm follows Leon et al., Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling, Sci. Rep. 14, 25016 (2024).