How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="HauserGroup/ModernMolBERT-base")
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HauserGroup/ModernMolBERT-base")
model = AutoModelForMaskedLM.from_pretrained("HauserGroup/ModernMolBERT-base")
Quick Links

HauserGroup/ModernMolBERT-base

ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings.

Model Details

  • Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
  • Model type: ModernBERT encoder — molecular embedding model trained with masked language modeling
  • Input representation: SELFIES (convert SMILES first; see below)
  • Tokenizer: Atom Pair Encoding (APE) over SELFIES primitives
  • Pre-training data: ChEMBL 36 (~2.4M unique small molecules)
  • License: MIT
  • Repository: https://github.com/HauserGroup/ModernMolBERT
field value
model_type modernbert
vocab_size 631
hidden_size 768
num_hidden_layers 12
num_attention_heads 12
intermediate_size 3072
max_position_embeddings 128

How to Get Started with the Model

The model consumes SELFIES strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding:

# pip install transformers torch
import torch
from transformers import AutoModel, AutoTokenizer

repo = 'HauserGroup/ModernMolBERT-base'
model = AutoModel.from_pretrained(repo).eval()
tokenizer = AutoTokenizer.from_pretrained(
    repo,
    subfolder='ape_tokenizer',
    trust_remote_code=True,
    use_fast=False,
)

# A SELFIES string (one bracketed token per primitive); here aspirin.
selfies = '[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]'

inputs = tokenizer(selfies, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state[:, 0]

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n")
print(f"Tokens:\n{tokens}\n")
print(f"Embedding shape: {tuple(embedding.shape)}")

If you start from SMILES, convert it to SELFIES first (e.g. the selfies package: selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")).

For masked-token predictions, load the same checkpoint with AutoModelForMaskedLM:

from transformers import AutoModelForMaskedLM

mlm = AutoModelForMaskedLM.from_pretrained(repo)
logits = mlm(**inputs).logits
print(f"Logits shape: {tuple(logits.shape)}")

Current Transformers releases disable custom root tokenizers for model_type='modernbert' before loading auto_map, so the tokenizer must be loaded from ape_tokenizer/. The root tokenizer files are also shipped for forward compatibility.

Training

field value

Uses

  • Direct use: molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in.
  • Downstream use: fine-tuning for molecular classification or regression on SELFIES inputs.
  • Out of scope: natural-language text; generating valid SMILES; 3D/conformer-dependent tasks.

Bias, Risks, and Limitations

Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information.

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support