|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- modernbert |
|
|
- ModChemBERT |
|
|
- cheminformatics |
|
|
- chemical-language-model |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# ModChemBERT: ModernBERT as a Chemical Language Model |
|
|
ModChemBERT-IR-BASE is a ModernBERT-based chemical language model (CLM) pretrained on SMILES strings using masked language modeling (MLM). This model serves as a base model for training embedding, retrieval, and reranking models for molecular information retrieval tasks. |
|
|
|
|
|
## Usage |
|
|
Install the `transformers` library starting from v4.56.1: |
|
|
|
|
|
```bash |
|
|
pip install -U "transformers>=4.56.1,<5.0.0" |
|
|
``` |
|
|
|
|
|
### Load Model |
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
|
|
model_id = "Derify/ModChemBERT-IR-BASE" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForMaskedLM.from_pretrained( |
|
|
model_id, |
|
|
trust_remote_code=True, |
|
|
dtype="bfloat16", |
|
|
device_map="auto", |
|
|
) |
|
|
``` |
|
|
|
|
|
### Fill-Mask Pipeline |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
fill = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
print(fill("c1ccccc1[MASK]")) |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
- Backbone: ModernBERT [1] |
|
|
- Hidden size: 1024 |
|
|
- Intermediate size: 1536 |
|
|
- Encoder Layers: 22 |
|
|
- Attention heads: 16 |
|
|
- Max sequence length: 512 tokens |
|
|
- Tokenizer: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens) |
|
|
|
|
|
## Dataset |
|
|
- Pretraining: [PubChem 110M dataset (canonical SMILES strings)](https://ibm.ent.box.com/v/MoLFormer-data) |
|
|
|
|
|
## Pooling (Classifier / Regressor Head) |
|
|
Kallergis et al. [2] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters. |
|
|
|
|
|
Behrendt et al. [3] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes. |
|
|
|
|
|
This base model includes configurable pooling strategies for downstream fine-tuning. When fine-tuned for embedding, retrieval, or reranking tasks (e.g., with Sentence Transformers), various pooling methods can be explored: |
|
|
- `cls`: Last layer [CLS] |
|
|
- `mean`: Mean over last hidden layer |
|
|
- `max_cls`: Max over last k layers of [CLS] |
|
|
- `cls_mha`: MHA with [CLS] as query |
|
|
- `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query |
|
|
- `mean_seq_mha`: MHA with mean pooled sequence as KV and mean pooled [CLS] as query |
|
|
- `sum_mean`: Sum over all layers then mean tokens |
|
|
- `sum_sum`: Sum over all layers then sum tokens |
|
|
- `mean_mean`: Mean over all layers then mean tokens |
|
|
- `mean_sum`: Mean over all layers then sum tokens |
|
|
- `max_seq_mean`: Max over last k layers then mean tokens |
|
|
|
|
|
Note: ModChemBERT's `cls_mha`, `max_seq_mha`, and `mean_seq_mha` differ from MaxPoolBERT [3]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT's `ModernBertAttention`. |
|
|
On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)). |
|
|
|
|
|
## Intended Use |
|
|
* Primary: Base model for training embedding, retrieval, and reranking models for chemical information retrieval tasks using frameworks such as Sentence Transformers. |
|
|
* Appropriate for: Fine-tuning for semantic search of chemical compounds, molecular similarity tasks, chemical information retrieval systems, and as a foundation for building chemical embedding models. |
|
|
* Not intended for: Direct molecular property prediction without fine-tuning, generating novel molecules, or production use without domain-specific validation. |
|
|
|
|
|
## Limitations |
|
|
- This is a base model pretrained only on masked language modeling; it requires fine-tuning for specific information retrieval tasks. |
|
|
- Performance on out-of-domain chemical spaces may vary: very long SMILES (>512 tokens), inorganic/organometallic compounds, polymers, or charged/enumerated tautomers may not be well represented in the training corpus. |
|
|
- The model reflects the chemical space distribution of PubChem and may not generalize equally well to all chemical domains. |
|
|
|
|
|
## Ethical Considerations & Responsible Use |
|
|
- This base model is intended for research and development purposes in chemical information retrieval. |
|
|
- When fine-tuned for downstream applications, users should validate performance on their specific domain and use case. |
|
|
- Do not deploy in clinical, regulatory, or safety-critical settings without rigorous domain-specific validation and appropriate oversight. |
|
|
|
|
|
## Hardware |
|
|
Training was performed on two NVIDIA RTX 3090 GPUs using `accelerate` for distributed (DDP) training. |
|
|
|
|
|
## Citation |
|
|
If you use ModChemBERT-IR-BASE in your research, please cite the checkpoint and the following: |
|
|
``` |
|
|
@software{cortes-2025-modchembert, |
|
|
author = {Emmanuel Cortes}, |
|
|
title = {ModChemBERT: ModernBERT as a Chemical Language Model}, |
|
|
year = {2025}, |
|
|
publisher = {GitHub}, |
|
|
howpublished = {GitHub repository}, |
|
|
url = {https://github.com/emapco/ModChemBERT} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
1. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024). |
|
|
2. Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4 |
|
|
3. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025). |
|
|
|