---
license: apache-2.0
datasets:
- Derify/augmented_canonical_druglike_QED_Pfizer_15M
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- modernbert
- ModChemBERT
- cheminformatics
- chemical-language-model
- molecular-property-prediction
pipeline_tag: fill-mask
model-index:
- name: Derify/ModChemBERT-MLM
  results:
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BACE
      type: BACE
    metrics:
    - type: roc_auc
      value: 0.8065
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BBBP
      type: BBBP
    metrics:
    - type: roc_auc
      value: 0.7222
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: CLINTOX
      type: CLINTOX
    metrics:
    - type: roc_auc
      value: 0.9709
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: HIV
      type: HIV
    metrics:
    - type: roc_auc
      value: 0.7800
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: SIDER
      type: SIDER
    metrics:
    - type: roc_auc
      value: 0.6419
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: TOX21
      type: TOX21
    metrics:
    - type: roc_auc
      value: 0.7400
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: BACE
      type: BACE
    metrics:
    - type: rmse
      value: 1.0893
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: CLEARANCE
      type: CLEARANCE
    metrics:
    - type: rmse
      value: 49.0005
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ESOL
      type: ESOL
    metrics:
    - type: rmse
      value: 0.8456
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: FREESOLV
      type: FREESOLV
    metrics:
    - type: rmse
      value: 0.5491
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: LIPO
      type: LIPO
    metrics:
    - type: rmse
      value: 0.7147
---

# ModChemBERT: ModernBERT as a Chemical Language Model
ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).

## Usage
### Load Model
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "Derify/ModChemBERT-MLM"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="float16",
    device_map="auto",
)
```

### Fill-Mask Pipeline
```python
from transformers import pipeline

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill("c1ccccc1[MASK]"))
```

## Intended Use
* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
* Not intended for generating novel molecules.

## Limitations
- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
- No guarantee of synthesizability, safety, or biological efficacy.

## Ethical Considerations & Responsible Use
- Potential biases arise from training corpora skewed to drug-like space.
- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.

## Architecture
- Backbone: ModernBERT
- Hidden size: 768
- Intermediate size: 1152
- Encoder Layers: 22
- Attention heads: 12
- Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences)
- Vocabulary: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens)

## Pooling (Classifier / Regressor Head)
Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head can significantly impact downstream performance.

Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks.

Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance:
- `cls`: Last layer [CLS]
- `mean`: Mean over last hidden layer
- `max_cls`: Max over last k layers of [CLS]
- `cls_mha`: MHA with [CLS] as query
- `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query
- `sum_mean`: Sum over all layers then mean tokens
- `sum_sum`: Sum over all layers then sum tokens
- `mean_mean`: Mean over all layers then mean tokens
- `mean_sum`: Mean over all layers then sum tokens
- `max_seq_mean`: Max over last k layers then mean tokens

## Training Pipeline
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/656892962693fa22e18b5331/bxNbpgMkU8m60ypyEJoWQ.png" alt="ModChemBERT Training Pipeline" width="650"/>
</div>

### Rationale for MTR Stage
Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR).

### Checkpoint Averaging Motivation
Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint.

## Datasets
- Pretraining: [Derify/augmented_canonical_druglike_QED_Pfizer_15M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_Pfizer_15M)
- Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME + AstraZeneca datasets (10 tasks) with scaffold splits from DA4MT pipeline (see [domain-adaptation-molecular-transformers](https://github.com/emapco/ModChemBERT/tree/main/domain-adaptation-molecular-transformers))
- Benchmarking: ChemBERTa-3 [7] tasks (BACE, BBBP, TOX21, HIV, SIDER, CLINTOX for classification; ESOL, FREESOLV, LIPO, BACE, CLEARANCE for regression)

## Benchmarking
Benchmarks were conducted with the ChemBERTa-3 framework using DeepChem scaffold splits. Each task was trained for 100 epochs with 3 random seeds.

### Evaluation Methodology
- Classification Metric: ROC AUC.
- Regression Metric: RMSE.
- Aggregation: Mean ± standard deviation of the triplicate results.
- Input Constraints: SMILES truncated / filtered to ≤200 tokens, following the MolFormer paper's recommendation.

### Results
<details><summary>Click to expand</summary>

#### Classification Datasets (ROC AUC - Higher is better)

| Model                                                                        | BACE↑             | BBBP↑             | CLINTOX↑              | HIV↑                  | SIDER↑                | TOX21↑            | AVG†   |
| ---------------------------------------------------------------------------- | ----------------- | ----------------- | --------------------- | --------------------- | --------------------- | ----------------- | ------ |
| **Tasks**                                                                    | 1                 | 1                 | 2                     | 1                     | 27                    | 12                |        |
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)*    | 0.781 ± 0.019     | 0.700 ± 0.027     | 0.979 ± 0.022         | 0.740 ± 0.013         | 0.611 ± 0.002         | 0.718 ± 0.011     | 0.7548 |
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)*      | 0.819 ± 0.019     | 0.735 ± 0.019     | 0.839 ± 0.013         | 0.762 ± 0.005         | 0.618 ± 0.005         | 0.723 ± 0.012     | 0.7493 |
| MoLFormer-LHPC*                                                              | **0.887 ± 0.004** | **0.908 ± 0.013** | 0.993 ± 0.004         | 0.750 ± 0.003         | 0.622 ± 0.007         | **0.791 ± 0.014** | 0.8252 |
| -------------------------                                                    | ----------------- | ----------------- | -------------------   | -------------------   | -------------------   | ----------------- | ------ |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 0.8065 ± 0.0103   | 0.7222 ± 0.0150   | 0.9709 ± 0.0227       | ***0.7800 ± 0.0133*** | 0.6419 ± 0.0113       | 0.7400 ± 0.0044   | 0.7769 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | 0.8224 ± 0.0156   | 0.7402 ± 0.0095   | 0.9820 ± 0.0138       | 0.7702 ± 0.0020       | 0.6303 ± 0.0039       | 0.7360 ± 0.0036   | 0.7802 |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 0.7924 ± 0.0155   | 0.7282 ± 0.0058   | 0.9725 ± 0.0213       | 0.7770 ± 0.0047       | 0.6542 ± 0.0128       | *0.7646 ± 0.0039* | 0.7815 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8213 ± 0.0051   | 0.7356 ± 0.0094   | 0.9664 ± 0.0202       | 0.7750 ± 0.0048       | 0.6415 ± 0.0094       | 0.7263 ± 0.0036   | 0.7777 |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | *0.8346 ± 0.0045* | *0.7573 ± 0.0120* | ***0.9938 ± 0.0017*** | 0.7737 ± 0.0034       | ***0.6600 ± 0.0061*** | 0.7518 ± 0.0047   | 0.7952 |

#### Regression Datasets (RMSE - Lower is better)

| Model                                                                        | BACE↓                 | CLEARANCE↓             | ESOL↓                 | FREESOLV↓             | LIPO↓                 | AVG‡             |
| ---------------------------------------------------------------------------- | --------------------- | ---------------------- | --------------------- | --------------------- | --------------------- | ---------------- |
| **Tasks**                                                                    | 1                     | 1                      | 1                     | 1                     | 1                     |                  |
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)*    | 1.011 ± 0.038         | 51.582 ± 3.079         | 0.920 ± 0.011         | 0.536 ± 0.016         | 0.758 ± 0.013         | 0.8063 / 10.9614 |
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)*      | 1.094 ± 0.126         | 52.058 ± 2.767         | 0.829 ± 0.019         | 0.572 ± 0.023         | 0.728 ± 0.016         | 0.8058 / 11.0562 |
| MoLFormer-LHPC*                                                              | 1.201 ± 0.100         | 45.74 ± 2.637          | 0.848 ± 0.031         | 0.683 ± 0.040         | 0.895 ± 0.080         | 0.9068 / 9.8734  |
| -------------------------                                                    | -------------------   | --------------------   | -------------------   | -------------------   | -------------------   | ---------------- |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 1.0893 ± 0.1319       | 49.0005 ± 1.2787       | 0.8456 ± 0.0406       | 0.5491 ± 0.0134       | 0.7147 ± 0.0062       | 0.7997 / 10.4398 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | 0.9931 ± 0.0258       | 45.4951 ± 0.7112       | 0.9319 ± 0.0153       | 0.6049 ± 0.0666       | 0.6874 ± 0.0040       | 0.8043 / 9.7425  |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 1.0304 ± 0.1146       | 47.8418 ± 0.4070       | ***0.7669 ± 0.0024*** | 0.5293 ± 0.0267       | 0.6708 ± 0.0074       | 0.7493 / 10.1678 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.9713 ± 0.0224       | ***42.8010 ± 3.3475*** | 0.8169 ± 0.0268       | 0.5445 ± 0.0257       | 0.6820 ± 0.0028       | 0.7537 / 9.1631  |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | ***0.9665 ± 0.0250*** | 44.0137 ± 1.1110       | 0.8158 ± 0.0115       | ***0.4979 ± 0.0158*** | ***0.6505 ± 0.0126*** | 0.7327 / 9.3889  |

**Bold** indicates the best result in the column; *italic* indicates the best result among ModChemBERT checkpoints.<br/>
\* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.<br/>
† AVG column shows the mean score across all classification tasks.<br/>
‡ AVG column shows the mean scores across all regression tasks without and with the clearance score.

</details>

## Optimized ModChemBERT Hyperparameters

<details><summary>Click to expand</summary>

### TAFT Datasets
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

| Dataset                | Learning Rate | Batch Size | Warmup Ratio | Classifier Pooling | Last k Layers |
| ---------------------- | ------------- | ---------- | ------------ | ------------------ | ------------- |
| adme_microsom_stab_h   | 3e-5          | 8          | 0.0          | max_seq_mean       | 5             |
| adme_microsom_stab_r   | 3e-5          | 16         | 0.2          | max_cls            | 3             |
| adme_permeability      | 3e-5          | 8          | 0.0          | max_cls            | 3             |
| adme_ppb_h             | 1e-5          | 32         | 0.1          | max_seq_mean       | 5             |
| adme_ppb_r             | 1e-5          | 32         | 0.0          | sum_mean           | N/A           |
| adme_solubility        | 3e-5          | 32         | 0.0          | sum_mean           | N/A           |
| astrazeneca_CL         | 3e-5          | 8          | 0.1          | max_seq_mha        | 3             |
| astrazeneca_LogD74     | 1e-5          | 8          | 0.0          | max_seq_mean       | 5             |
| astrazeneca_PPB        | 1e-5          | 32         | 0.0          | max_cls            | 3             |
| astrazeneca_Solubility | 1e-5          | 32         | 0.0          | max_seq_mean       | 5             |

### Benchmarking Datasets
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

| Dataset             | Batch Size | Classifier Pooling | Last k Layers | Pooling Attention Dropout | Classifier Dropout | Embedding Dropout |
| ------------------- | ---------- | ------------------ | ------------- | ------------------------- | ------------------ | ----------------- |
| bace_classification | 32         | max_seq_mha        | 3             | 0.0                       | 0.0                | 0.0               |
| bbbp                | 64         | max_cls            | 3             | 0.1                       | 0.0                | 0.0               |
| clintox             | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| hiv                 | 32         | max_seq_mha        | 3             | 0.0                       | 0.0                | 0.0               |
| sider               | 32         | mean               | N/A           | 0.1                       | 0.0                | 0.1               |
| tox21               | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| base_regression     | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| clearance           | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| esol                | 64         | sum_mean           | N/A           | 0.1                       | 0.0                | 0.1               |
| freesolv            | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| lipo                | 32         | max_seq_mha        | 3             | 0.1                       | 0.1                | 0.1               |

</details>

## Hardware
Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.

## Citation
If you use ModChemBERT in your research, please cite the checkpoint and the following:
```
@software{cortes-2025-modchembert,
  author = {Emmanuel Cortes},
  title = {ModChemBERT: ModernBERT as a Chemical Language Model},
  year = {2025},
  publisher = {GitHub},
  howpublished = {GitHub repository},
  url = {https://github.com/emapco/ModChemBERT}
}
```

## References
1. Kallergis, Georgios, et al. "Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa." Communications Chemistry 8.1 (2025): 114.
2. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).
3. Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025).
4. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
5. Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." Journal of Natural Language Processing 32.1 (2025): 176-218.
6. Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
7. Singh, Riya, et al. "ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models." (2025).