Upload trained ModernMolBERT checkpoint

9b13a24 verified about 1 month ago

3.66 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: fill-mask
	tags:
	- chemistry
	- molecules
	- selfies
	- ape-tokenizer
	- modernbert
	- masked-language-modeling
	---

	# HauserGroup/ModernMolBERT-base

	ModernMolBERT is a compact ModernBERT encoder pre-trained from scratch with masked language modeling on ~2.4M SELFIES strings from ChEMBL 36, using a chemically aware Atom Pair Encoding (APE) tokenizer. It expects SELFIES input and produces general-purpose molecular embeddings.

	## Model Details

	- Developed by: Hauser Group, Department of Drug Design and Pharmacology, University of Copenhagen
	- Model type: ModernBERT encoder — molecular embedding model trained with masked language modeling
	- Input representation: SELFIES (convert SMILES first; see below)
	- Tokenizer: Atom Pair Encoding (APE) over SELFIES primitives
	- Pre-training data: ChEMBL 36 (~2.4M unique small molecules)
	- License: MIT
	- Repository: https://github.com/HauserGroup/ModernMolBERT

	\| field \| value \|
	\|-------\|-------\|
	\| model_type \| modernbert \|
	\| vocab_size \| 631 \|
	\| hidden_size \| 768 \|
	\| num_hidden_layers \| 12 \|
	\| num_attention_heads \| 12 \|
	\| intermediate_size \| 3072 \|
	\| max_position_embeddings \| 128 \|

	## How to Get Started with the Model

	The model consumes SELFIES strings tokenized with the APE tokenizer. The main output for molecular representation learning is the first-token embedding:

	```python
	# pip install transformers torch
	import torch
	from transformers import AutoModel, AutoTokenizer

	repo = 'HauserGroup/ModernMolBERT-base'
	model = AutoModel.from_pretrained(repo).eval()
	tokenizer = AutoTokenizer.from_pretrained(
	repo,
	subfolder='ape_tokenizer',
	trust_remote_code=True,
	use_fast=False,
	)

	# A SELFIES string (one bracketed token per primitive); here aspirin.
	selfies = '[C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]'

	inputs = tokenizer(selfies, return_tensors='pt')
	with torch.no_grad():
	outputs = model(**inputs)
	embedding = outputs.last_hidden_state[:, 0]

	tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
	print(f"Token IDs:\n{inputs['input_ids'][0].tolist()}\n")
	print(f"Tokens:\n{tokens}\n")
	print(f"Embedding shape: {tuple(embedding.shape)}")
	```

	If you start from SMILES, convert it to SELFIES first (e.g. the [`selfies`](https://github.com/aspuru-guzik-group/selfies) package: `selfies.encoder("CC(=O)Oc1ccccc1C(=O)O")`).

	For masked-token predictions, load the same checkpoint with `AutoModelForMaskedLM`:

	```python
	from transformers import AutoModelForMaskedLM

	mlm = AutoModelForMaskedLM.from_pretrained(repo)
	logits = mlm(**inputs).logits
	print(f"Logits shape: {tuple(logits.shape)}")
	```

	> Current Transformers releases disable custom root tokenizers for `model_type='modernbert'` before loading `auto_map`, so the tokenizer must be loaded from `ape_tokenizer/`. The root tokenizer files are also shipped for forward compatibility.

	## Training

	\| field \| value \|
	\|-------\|-------\|


	## Uses

	- Direct use: molecular embeddings for property prediction, similarity search, clustering, and retrieval; masked-token fill-in.
	- Downstream use: fine-tuning for molecular classification or regression on SELFIES inputs.
	- Out of scope: natural-language text; generating valid SMILES; 3D/conformer-dependent tasks.

	## Bias, Risks, and Limitations

	Pre-trained only on drug-like ChEMBL 36 chemistry; may not generalize to natural products, agrochemicals, fragments, or other under-represented chemical space. Performance depends on the downstream task and adaptation strategy. No access to 3D/conformer information.