HauserGroup
/

ApeTokenizer-SELFIES

Model card Files Files and versions

ApeTokenizer-SELFIES / README.md

jsture's picture

Add APE SELFIES tokenizer with max length 256

fdd34dc verified 30 days ago

|

History Blame Contribute Delete

3.15 kB

	---
	license: mit
	library_name: transformers
	tags:
	- chemistry
	- molecules
	- selfies
	- ape-tokenizer
	- tokenizer
	---

	# ApeTokenizer-SELFIES

	ApeTokenizer-SELFIES is the Atom Pair Encoding (APE) tokenizer used by
	[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT) — a family of
	compact encoder-only transformer models for small-molecule representation
	learning pre-trained on SELFIES strings from ChEMBL 36.

	APE is a byte-pair-style merging scheme applied directly to SELFIES bracket
	tokens, so every token boundary aligns with a chemically valid SELFIES
	primitive. The vocabulary is derived from ~2M unique
	SELFIES strings from ChEMBL 36.

	## Tokenizer Details

	- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
	- Input representation: SELFIES (convert SMILES first; see below)
	- Algorithm: Atom Pair Encoding (APE) — pair merging over SELFIES bracket tokens
	- Vocabulary size: 631
	- Max merge pieces: 2
	- Min merge frequency: 3000
	- Training corpus size: 2M unique SELFIES (ChEMBL 36)
	- License: MIT
	- Repository: https://github.com/HauserGroup/ModernMolBERT

	\| special token \| id \|
	\|---------------\|----\|
	\| `<s>` (BOS) \| 0 \|
	\| `<pad>` \| 1 \|
	\| `</s>` (EOS) \| 2 \|
	\| `<unk>` \| 3 \|
	\| `<mask>` \| 4 \|

	## How to Get Started

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"HauserGroup/ApeTokenizer-SELFIES",
	trust_remote_code=True,
	use_fast=False,
	)

	# A SELFIES string — here aspirin.
	selfies = "[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]"

	tokens = tokenizer.tokenize(selfies)
	print(tokens)
	# ['[C][C]', '[=Branch1][C]', '[=O][O]', '[C][=C]', '[C][=C]', '[C][=C]', '[Ring1][=Branch1]', '[C][=Branch1]', '[C][=O]', '[O]']

	inputs = tokenizer(selfies, return_tensors="pt")
	print(inputs["input_ids"])
	# tensor([[ 0, 334, 335, 370, 333, 333, 333, 338, 377, 511, 6, 2]])
	```

	If you start from SMILES, convert first:

	```python
	import selfies
	smi = "CC(=O)Oc1ccccc1C(=O)O"
	sf = selfies.encoder(smi) # '[C][C][=Branch1][C][=O][O][C]...'
	inputs = tokenizer(sf, return_tensors="pt")
	```

	### Using with ModernMolBERT models

	This tokenizer is shared by all four ModernMolBERT checkpoints. Load it from
	the model repo using `subfolder="ape_tokenizer"` to avoid routing
	`AutoTokenizer` to the built-in fast ModernBERT tokenizer:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"HauserGroup/ModernMolBERT-small",
	subfolder="ape_tokenizer",
	trust_remote_code=True,
	use_fast=False,
	)
	```

	Or load this standalone repo directly as shown above — both produce identical
	tokenizations.

	## Citation

	```bibtex
	@article{madsen_modernmolbert,
	title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
	author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
	year = {2026}
	}
	```

	The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
	tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).