Instructions to use HauserGroup/ModernMolBERT-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ModernMolBERT-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="HauserGroup/ModernMolBERT-small")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("HauserGroup/ModernMolBERT-small") model = AutoModelForMaskedLM.from_pretrained("HauserGroup/ModernMolBERT-small") - Notebooks
- Google Colab
- Kaggle
HauserGroup/ModernMolBERT-small
ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings.
Model Details
- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- Model type: ModernBERT encoder — molecular embedding model trained with masked language modeling
- Input representation: SELFIES (convert SMILES first; see below)
- Tokenizer: Atom Pair Encoding (APE) over SELFIES primitives
- Pre-training data: ChEMBL 36 (~2.4M unique small molecules)
- License: MIT
- Repository: https://github.com/HauserGroup/ModernMolBERT
| field | value |
|---|---|
| model_type | modernbert |
| vocab_size | 631 |
| hidden_size | 512 |
| num_hidden_layers | 8 |
| num_attention_heads | 8 |
| intermediate_size | 2048 |
| max_position_embeddings | 128 |
How to Get Started with the Model
The model consumes SELFIES strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding:
# pip install transformers torch
import torch
from transformers import AutoModel, AutoTokenizer
repo = 'HauserGroup/ModernMolBERT-small'
model = AutoModel.from_pretrained(repo).eval()
tokenizer = AutoTokenizer.from_pretrained(
repo,
subfolder='ape_tokenizer',
trust_remote_code=True,
use_fast=False,
)
# A SELFIES string (one bracketed token per primitive); here psilocybin.
selfies = '[C][N][Branch1][C][C][C][C][C][=C][NH1][C][=C][C][=C][C][Branch1][#Branch2][O][P][=Branch1][C][=O][Branch1][C][O][O][=C][Ring1][=C][Ring1][O]'
inputs = tokenizer(selfies, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0]
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
embedding_preview = [round(x, 4) for x in embedding[0, :5].tolist()]
print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n")
print(f"Tokens:\n{tokens}\n")
print(f"Embedding shape: {tuple(embedding.shape)}")
print(f"Embedding first 5 values:\n{embedding_preview}")
Output:
Token IDs:
[0, 352, 336, 334, 334, 7, 406, 388, 388, 392, 489, 335, 18, 336, 426, 482, 482, 6, 2]
Tokens:
['<s>', '[C][N]', '[Branch1][C]', '[C][C]', '[C][C]', '[=C]', '[NH1][C]', '[=C][C]', '[=C][C]', '[Branch1][#Branch2]', '[O][P]', '[=Branch1][C]', '[=O]', '[Branch1][C]', '[O][O]', '[=C][Ring1]', '[=C][Ring1]', '[O]', '</s>']
Embedding shape: (1, 512)
Embedding first 5 values:
[-0.1029, 0.2197, -0.0518, -0.7983, -0.6783]
If you start from SMILES, convert it to SELFIES first (e.g. the selfies package: selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")).
For masked-token predictions, load the same checkpoint with AutoModelForMaskedLM:
from transformers import AutoModelForMaskedLM
mlm = AutoModelForMaskedLM.from_pretrained(repo)
logits = mlm(**inputs).logits
print(f"Logits shape: {tuple(logits.shape)}")
Output:
Logits shape: (1, 19, 631)
Current Transformers releases disable custom root tokenizers for
model_type='modernbert'before loadingauto_map, so the tokenizer must be loaded fromape_tokenizer/. The root tokenizer files are also shipped for forward compatibility.
Uses
- Direct use: molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in.
- Downstream use: fine-tuning for molecular classification or regression on SELFIES inputs.
- Out of scope: natural-language text; generating valid SMILES; 3D/conformer-dependent tasks.
Bias, Risks, and Limitations
Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information.
- Downloads last month
- 10