Fill-Mask
Transformers
Safetensors
modernbert
chemistry
molecules
selfies
ape-tokenizer
masked-language-modeling
Instructions to use HauserGroup/ModernMolBERT-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HauserGroup/ModernMolBERT-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="HauserGroup/ModernMolBERT-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("HauserGroup/ModernMolBERT-base") model = AutoModelForMaskedLM.from_pretrained("HauserGroup/ModernMolBERT-base") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| pipeline_tag: fill-mask | |
| tags: | |
| - chemistry | |
| - molecules | |
| - selfies | |
| - ape-tokenizer | |
| - modernbert | |
| - masked-language-modeling | |
| # HauserGroup/ModernMolBERT-base | |
| ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings. | |
| ## Model Details | |
| - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen | |
| - **Model type:** ModernBERT encoder — molecular embedding model trained with masked language modeling | |
| - **Input representation:** SELFIES (convert SMILES first; see below) | |
| - **Tokenizer:** Atom Pair Encoding (APE) over SELFIES primitives | |
| - **Pre-training data:** ChEMBL 36 (~2.4M unique small molecules) | |
| - **License:** MIT | |
| - **Repository:** https://github.com/HauserGroup/ModernMolBERT | |
| | field | value | | |
| |-------|-------| | |
| | model_type | modernbert | | |
| | vocab_size | 631 | | |
| | hidden_size | 768 | | |
| | num_hidden_layers | 12 | | |
| | num_attention_heads | 12 | | |
| | intermediate_size | 3072 | | |
| | max_position_embeddings | 128 | | |
| ## How to Get Started with the Model | |
| The model consumes **SELFIES** strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding: | |
| ```python | |
| # pip install transformers torch | |
| import torch | |
| from transformers import AutoModel, AutoTokenizer | |
| repo = 'HauserGroup/ModernMolBERT-base' | |
| model = AutoModel.from_pretrained(repo).eval() | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| repo, | |
| subfolder='ape_tokenizer', | |
| trust_remote_code=True, | |
| use_fast=False, | |
| ) | |
| # A SELFIES string (one bracketed token per primitive); here aspirin. | |
| selfies = '[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]' | |
| inputs = tokenizer(selfies, return_tensors='pt') | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| embedding = outputs.last_hidden_state[:, 0] | |
| tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) | |
| print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n") | |
| print(f"Tokens:\n{tokens}\n") | |
| print(f"Embedding shape: {tuple(embedding.shape)}") | |
| ``` | |
| If you start from SMILES, convert it to SELFIES first (e.g. the [`selfies`](https://github.com/aspuru-guzik-group/selfies) package: `selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")`). | |
| For masked-token predictions, load the same checkpoint with `AutoModelForMaskedLM`: | |
| ```python | |
| from transformers import AutoModelForMaskedLM | |
| mlm = AutoModelForMaskedLM.from_pretrained(repo) | |
| logits = mlm(**inputs).logits | |
| print(f"Logits shape: {tuple(logits.shape)}") | |
| ``` | |
| > Current Transformers releases disable custom root tokenizers for `model_type='modernbert'` before loading `auto_map`, so the tokenizer must be loaded from `ape_tokenizer/`. The root tokenizer files are also shipped for forward compatibility. | |
| ## Training | |
| | field | value | | |
| |-------|-------| | |
| ## Uses | |
| - **Direct use:** molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in. | |
| - **Downstream use:** fine-tuning for molecular classification or regression on SELFIES inputs. | |
| - **Out of scope:** natural-language text; generating valid SMILES; 3D/conformer-dependent tasks. | |
| ## Bias, Risks, and Limitations | |
| Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information. | |