ModernMolBERT-base / README.md
jsture's picture
Upload trained ModernMolBERT checkpoint
9b13a24 verified
|
Raw
History Blame Contribute Delete
3.66 kB
---
license: mit
library_name: transformers
pipeline_tag: fill-mask
tags:
- chemistry
- molecules
- selfies
- ape-tokenizer
- modernbert
- masked-language-modeling
---
# HauserGroup/ModernMolBERT-base
ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings.
## Model Details
- **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
- **Model type:** ModernBERT encoder — molecular embedding model trained with masked language modeling
- **Input representation:** SELFIES (convert SMILES first; see below)
- **Tokenizer:** Atom Pair Encoding (APE) over SELFIES primitives
- **Pre-training data:** ChEMBL 36 (~2.4M unique small molecules)
- **License:** MIT
- **Repository:** https://github.com/HauserGroup/ModernMolBERT
| field | value |
|-------|-------|
| model_type | modernbert |
| vocab_size | 631 |
| hidden_size | 768 |
| num_hidden_layers | 12 |
| num_attention_heads | 12 |
| intermediate_size | 3072 |
| max_position_embeddings | 128 |
## How to Get Started with the Model
The model consumes **SELFIES** strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding:
```python
# pip install transformers torch
import torch
from transformers import AutoModel, AutoTokenizer
repo = 'HauserGroup/ModernMolBERT-base'
model = AutoModel.from_pretrained(repo).eval()
tokenizer = AutoTokenizer.from_pretrained(
repo,
subfolder='ape_tokenizer',
trust_remote_code=True,
use_fast=False,
)
# A SELFIES string (one bracketed token per primitive); here aspirin.
selfies = '[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]'
inputs = tokenizer(selfies, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0]
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n")
print(f"Tokens:\n{tokens}\n")
print(f"Embedding shape: {tuple(embedding.shape)}")
```
If you start from SMILES, convert it to SELFIES first (e.g. the [`selfies`](https://github.com/aspuru-guzik-group/selfies) package: `selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")`).
For masked-token predictions, load the same checkpoint with `AutoModelForMaskedLM`:
```python
from transformers import AutoModelForMaskedLM
mlm = AutoModelForMaskedLM.from_pretrained(repo)
logits = mlm(**inputs).logits
print(f"Logits shape: {tuple(logits.shape)}")
```
> Current Transformers releases disable custom root tokenizers for `model_type='modernbert'` before loading `auto_map`, so the tokenizer must be loaded from `ape_tokenizer/`. The root tokenizer files are also shipped for forward compatibility.
## Training
| field | value |
|-------|-------|
## Uses
- **Direct use:** molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in.
- **Downstream use:** fine-tuning for molecular classification or regression on SELFIES inputs.
- **Out of scope:** natural-language text; generating valid SMILES; 3D/conformer-dependent tasks.
## Bias, Risks, and Limitations
Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information.