--- license: mit library_name: transformers pipeline_tag: fill-mask tags: - chemistry - molecules - selfies - ape-tokenizer - modernbert - masked-language-modeling --- # HauserGroup/ModernMolBERT-small ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings. ## Model Details - **Developed by:** Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen - **Model type:** ModernBERT encoder — molecular embedding model trained with masked language modeling - **Input representation:** SELFIES (convert SMILES first; see below) - **Tokenizer:** Atom Pair Encoding (APE) over SELFIES primitives - **Pre-training data:** ChEMBL 36 (~2.4M unique small molecules) - **License:** MIT - **Repository:** https://github.com/HauserGroup/ModernMolBERT | field | value | |-------|-------| | model_type | modernbert | | vocab_size | 631 | | hidden_size | 512 | | num_hidden_layers | 8 | | num_attention_heads | 8 | | intermediate_size | 2048 | | max_position_embeddings | 128 | ## How to Get Started with the Model The model consumes **SELFIES** strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding: ```python # pip install transformers torch import torch from transformers import AutoModel, AutoTokenizer repo = 'HauserGroup/ModernMolBERT-small' model = AutoModel.from_pretrained(repo).eval() tokenizer = AutoTokenizer.from_pretrained( repo, subfolder='ape_tokenizer', trust_remote_code=True, use_fast=False, ) # A SELFIES string (one bracketed token per primitive); here psilocybin. selfies = '[C][N][Branch1][C][C][C][C][C][=C][NH1][C][=C][C][=C][C][Branch1][#Branch2][O][P][=Branch1][C][=O][Branch1][C][O][O][=C][Ring1][=C][Ring1][O]' inputs = tokenizer(selfies, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs) embedding = outputs.last_hidden_state[:, 0] tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) embedding_preview = [round(x, 4) for x in embedding[0, :5].tolist()] print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n") print(f"Tokens:\n{tokens}\n") print(f"Embedding shape: {tuple(embedding.shape)}") print(f"Embedding first 5 values:\n{embedding_preview}") ``` Output: ```text Token IDs: [0, 352, 336, 334, 334, 7, 406, 388, 388, 392, 489, 335, 18, 336, 426, 482, 482, 6, 2] Tokens: ['', '[C][N]', '[Branch1][C]', '[C][C]', '[C][C]', '[=C]', '[NH1][C]', '[=C][C]', '[=C][C]', '[Branch1][#Branch2]', '[O][P]', '[=Branch1][C]', '[=O]', '[Branch1][C]', '[O][O]', '[=C][Ring1]', '[=C][Ring1]', '[O]', ''] Embedding shape: (1, 512) Embedding first 5 values: [-0.1029, 0.2197, -0.0518, -0.7983, -0.6783] ``` If you start from SMILES, convert it to SELFIES first (e.g. the [`selfies`](https://github.com/aspuru-guzik-group/selfies) package: `selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")`). For masked-token predictions, load the same checkpoint with `AutoModelForMaskedLM`: ```python from transformers import AutoModelForMaskedLM mlm = AutoModelForMaskedLM.from_pretrained(repo) logits = mlm(**inputs).logits print(f"Logits shape: {tuple(logits.shape)}") ``` Output: ```text Logits shape: (1, 19, 631) ``` > Current Transformers releases disable custom root tokenizers for `model_type='modernbert'` before loading `auto_map`, so the tokenizer must be loaded from `ape_tokenizer/`. The root tokenizer files are also shipped for forward compatibility. ## Uses - **Direct use:** molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in. - **Downstream use:** fine-tuning for molecular classification or regression on SELFIES inputs. - **Out of scope:** natural-language text; generating valid SMILES; 3D/conformer-dependent tasks. ## Bias, Risks, and Limitations Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information.