HauserGroup
/

ApeTokenizer-SMILES

Model card Files Files and versions

ApeTokenizer-SMILES / README.md

jsture's picture

Add APE SMILES tokenizer with max length 256

29585a6 verified 15 days ago

|

History Blame Contribute Delete

2.49 kB

	---
	license: mit
	library_name: transformers
	tags:
	- chemistry
	- molecules
	- smiles
	- ape-tokenizer
	- tokenizer
	---

	# ApeTokenizer-SMILES

	ApeTokenizer-SMILES is an Atom Pair Encoding (APE) tokenizer for SMILES
	strings, trained on ~2M unique canonical SMILES from ChEMBL 36. It is
	the SMILES counterpart to
	[ApeTokenizer-SELFIES](https://huggingface.co/HauserGroup/ApeTokenizer-SELFIES),
	released alongside
	[ModernMolBERT](https://github.com/HauserGroup/ModernMolBERT).

	APE is a byte-pair-style merging scheme applied to SMILES symbol pieces (atoms,
	bonds, ring and branch symbols). Merges are frequency-driven, so a single token
	may span structural boundaries (for example `(=O)N(`), matching the APE
	behaviour described in Leon et al. — token boundaries are not guaranteed to be
	balanced sub-structures.

	## Tokenizer Details

	- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
	- Input representation: SMILES (canonical)
	- Algorithm: Atom Pair Encoding (APE) — pair merging over SMILES symbol pieces
	- Vocabulary size: 1386
	- Max merge pieces: 6
	- Min merge frequency: 3000
	- Training corpus size: 2M unique canonical SMILES (ChEMBL 36)
	- License: MIT
	- Repository: https://github.com/HauserGroup/ModernMolBERT

	\| special token \| id \|
	\|---------------\|----\|
	\| `<s>` (BOS) \| 0 \|
	\| `<pad>` \| 1 \|
	\| `</s>` (EOS) \| 2 \|
	\| `<unk>` \| 3 \|
	\| `<mask>` \| 4 \|

	## How to Get Started

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"HauserGroup/ApeTokenizer-SMILES",
	trust_remote_code=True,
	use_fast=False,
	)

	# A canonical SMILES — here aspirin.
	smiles = "CC(=O)Oc1ccccc1C(=O)O"

	tokens = tokenizer.tokenize(smiles)
	print(tokens)
	# ['CC(=O)', 'Oc1cc', 'ccc1', 'C(=O)O']

	inputs = tokenizer(smiles, return_tensors="pt")
	print(inputs["input_ids"])
	```

	Feed SMILES directly — no SELFIES conversion is needed. This is a standalone
	alternative tokenizer and is not the tokenizer used by the released
	SELFIES-based ModernMolBERT checkpoints.

	## Citation

	```bibtex
	@article{madsen_modernmolbert,
	title = {ModernMolBERT: A ModernBERT Encoder Family for SELFIES Molecular Language Modeling},
	author = {Madsen, Jakob S. and Angelucci, Sara and Hauser, Alexander S.},
	year = {2026}
	}
	```

	The APE algorithm follows Leon et al., *Comparing SMILES and SELFIES
	tokenization for enhanced chemical language modeling*, Sci. Rep. 14, 25016 (2024).