--- license: apache-2.0 datasets: - Derify/augmented_canonical_druglike_QED_Pfizer_15M metrics: - roc_auc - rmse library_name: transformers tags: - modernbert - ModChemBERT - cheminformatics - chemical-language-model - molecular-property-prediction pipeline_tag: fill-mask model-index: - name: Derify/ModChemBERT-MLM results: - task: type: text-classification name: Classification (ROC AUC) dataset: name: BACE type: BACE metrics: - type: roc_auc value: 0.8065 - task: type: text-classification name: Classification (ROC AUC) dataset: name: BBBP type: BBBP metrics: - type: roc_auc value: 0.7222 - task: type: text-classification name: Classification (ROC AUC) dataset: name: CLINTOX type: CLINTOX metrics: - type: roc_auc value: 0.9709 - task: type: text-classification name: Classification (ROC AUC) dataset: name: HIV type: HIV metrics: - type: roc_auc value: 0.7800 - task: type: text-classification name: Classification (ROC AUC) dataset: name: SIDER type: SIDER metrics: - type: roc_auc value: 0.6419 - task: type: text-classification name: Classification (ROC AUC) dataset: name: TOX21 type: TOX21 metrics: - type: roc_auc value: 0.7400 - task: type: regression name: Regression (RMSE) dataset: name: BACE type: BACE metrics: - type: rmse value: 1.0893 - task: type: regression name: Regression (RMSE) dataset: name: CLEARANCE type: CLEARANCE metrics: - type: rmse value: 49.0005 - task: type: regression name: Regression (RMSE) dataset: name: ESOL type: ESOL metrics: - type: rmse value: 0.8456 - task: type: regression name: Regression (RMSE) dataset: name: FREESOLV type: FREESOLV metrics: - type: rmse value: 0.5491 - task: type: regression name: Regression (RMSE) dataset: name: LIPO type: LIPO metrics: - type: rmse value: 0.7147 --- # ModChemBERT: ModernBERT as a Chemical Language Model ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression). ## Usage ### Load Model ```python from transformers import AutoModelForMaskedLM, AutoTokenizer model_id = "Derify/ModChemBERT-MLM" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained( model_id, trust_remote_code=True, dtype="float16", device_map="auto", ) ``` ### Fill-Mask Pipeline ```python from transformers import pipeline fill = pipeline("fill-mask", model=model, tokenizer=tokenizer) print(fill("c1ccccc1[MASK]")) ``` ## Intended Use * Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications. * Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning. * Not intended for generating novel molecules. ## Limitations - Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training. - No guarantee of synthesizability, safety, or biological efficacy. ## Ethical Considerations & Responsible Use - Potential biases arise from training corpora skewed to drug-like space. - Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation. ## Architecture - Backbone: ModernBERT - Hidden size: 768 - Intermediate size: 1152 - Encoder Layers: 22 - Attention heads: 12 - Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences) - Vocabulary: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens) ## Pooling (Classifier / Regressor Head) Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head can significantly impact downstream performance. Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks. Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance: - `cls`: Last layer [CLS] - `mean`: Mean over last hidden layer - `max_cls`: Max over last k layers of [CLS] - `cls_mha`: MHA with [CLS] as query - `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query - `sum_mean`: Sum over all layers then mean tokens - `sum_sum`: Sum over all layers then sum tokens - `mean_mean`: Mean over all layers then mean tokens - `mean_sum`: Mean over all layers then sum tokens - `max_seq_mean`: Max over last k layers then mean tokens ## Training Pipeline
ModChemBERT Training Pipeline
### Rationale for MTR Stage Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR). ### Checkpoint Averaging Motivation Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint. ## Datasets - Pretraining: [Derify/augmented_canonical_druglike_QED_Pfizer_15M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_Pfizer_15M) - Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME + AstraZeneca datasets (10 tasks) with scaffold splits from DA4MT pipeline (see [domain-adaptation-molecular-transformers](https://github.com/emapco/ModChemBERT/tree/main/domain-adaptation-molecular-transformers)) - Benchmarking: ChemBERTa-3 [7] tasks (BACE, BBBP, TOX21, HIV, SIDER, CLINTOX for classification; ESOL, FREESOLV, LIPO, BACE, CLEARANCE for regression) ## Benchmarking Benchmarks were conducted with the ChemBERTa-3 framework using DeepChem scaffold splits. Each task was trained for 100 epochs with 3 random seeds. ### Evaluation Methodology - Classification Metric: ROC AUC. - Regression Metric: RMSE. - Aggregation: Mean ± standard deviation of the triplicate results. - Input Constraints: SMILES truncated / filtered to ≤200 tokens, following the MolFormer paper's recommendation. ### Results
Click to expand #### Classification Datasets (ROC AUC - Higher is better) | Model | BACE↑ | BBBP↑ | CLINTOX↑ | HIV↑ | SIDER↑ | TOX21↑ | AVG† | | ---------------------------------------------------------------------------- | ----------------- | ----------------- | --------------------- | --------------------- | --------------------- | ----------------- | ------ | | **Tasks** | 1 | 1 | 2 | 1 | 27 | 12 | | | [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* | 0.781 ± 0.019 | 0.700 ± 0.027 | 0.979 ± 0.022 | 0.740 ± 0.013 | 0.611 ± 0.002 | 0.718 ± 0.011 | 0.7548 | | [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* | 0.819 ± 0.019 | 0.735 ± 0.019 | 0.839 ± 0.013 | 0.762 ± 0.005 | 0.618 ± 0.005 | 0.723 ± 0.012 | 0.7493 | | MoLFormer-LHPC* | **0.887 ± 0.004** | **0.908 ± 0.013** | 0.993 ± 0.004 | 0.750 ± 0.003 | 0.622 ± 0.007 | **0.791 ± 0.014** | 0.8252 | | ------------------------- | ----------------- | ----------------- | ------------------- | ------------------- | ------------------- | ----------------- | ------ | | [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 0.8065 ± 0.0103 | 0.7222 ± 0.0150 | 0.9709 ± 0.0227 | ***0.7800 ± 0.0133*** | 0.6419 ± 0.0113 | 0.7400 ± 0.0044 | 0.7769 | | [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | 0.8224 ± 0.0156 | 0.7402 ± 0.0095 | 0.9820 ± 0.0138 | 0.7702 ± 0.0020 | 0.6303 ± 0.0039 | 0.7360 ± 0.0036 | 0.7802 | | [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 0.7924 ± 0.0155 | 0.7282 ± 0.0058 | 0.9725 ± 0.0213 | 0.7770 ± 0.0047 | 0.6542 ± 0.0128 | *0.7646 ± 0.0039* | 0.7815 | | [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8213 ± 0.0051 | 0.7356 ± 0.0094 | 0.9664 ± 0.0202 | 0.7750 ± 0.0048 | 0.6415 ± 0.0094 | 0.7263 ± 0.0036 | 0.7777 | | [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | *0.8346 ± 0.0045* | *0.7573 ± 0.0120* | ***0.9938 ± 0.0017*** | 0.7737 ± 0.0034 | ***0.6600 ± 0.0061*** | 0.7518 ± 0.0047 | 0.7952 | #### Regression Datasets (RMSE - Lower is better) | Model | BACE↓ | CLEARANCE↓ | ESOL↓ | FREESOLV↓ | LIPO↓ | AVG‡ | | ---------------------------------------------------------------------------- | --------------------- | ---------------------- | --------------------- | --------------------- | --------------------- | ---------------- | | **Tasks** | 1 | 1 | 1 | 1 | 1 | | | [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)* | 1.011 ± 0.038 | 51.582 ± 3.079 | 0.920 ± 0.011 | 0.536 ± 0.016 | 0.758 ± 0.013 | 0.8063 / 10.9614 | | [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)* | 1.094 ± 0.126 | 52.058 ± 2.767 | 0.829 ± 0.019 | 0.572 ± 0.023 | 0.728 ± 0.016 | 0.8058 / 11.0562 | | MoLFormer-LHPC* | 1.201 ± 0.100 | 45.74 ± 2.637 | 0.848 ± 0.031 | 0.683 ± 0.040 | 0.895 ± 0.080 | 0.9068 / 9.8734 | | ------------------------- | ------------------- | -------------------- | ------------------- | ------------------- | ------------------- | ---------------- | | [MLM](https://huggingface.co/Derify/ModChemBERT-MLM) | 1.0893 ± 0.1319 | 49.0005 ± 1.2787 | 0.8456 ± 0.0406 | 0.5491 ± 0.0134 | 0.7147 ± 0.0062 | 0.7997 / 10.4398 | | [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT) | 0.9931 ± 0.0258 | 45.4951 ± 0.7112 | 0.9319 ± 0.0153 | 0.6049 ± 0.0666 | 0.6874 ± 0.0040 | 0.8043 / 9.7425 | | [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT) | 1.0304 ± 0.1146 | 47.8418 ± 0.4070 | ***0.7669 ± 0.0024*** | 0.5293 ± 0.0267 | 0.6708 ± 0.0074 | 0.7493 / 10.1678 | | [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.9713 ± 0.0224 | ***42.8010 ± 3.3475*** | 0.8169 ± 0.0268 | 0.5445 ± 0.0257 | 0.6820 ± 0.0028 | 0.7537 / 9.1631 | | [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT) | ***0.9665 ± 0.0250*** | 44.0137 ± 1.1110 | 0.8158 ± 0.0115 | ***0.4979 ± 0.0158*** | ***0.6505 ± 0.0126*** | 0.7327 / 9.3889 | **Bold** indicates the best result in the column; *italic* indicates the best result among ModChemBERT checkpoints.
\* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.
† AVG column shows the mean score across all classification tasks.
‡ AVG column shows the mean scores across all regression tasks without and with the clearance score.
## Optimized ModChemBERT Hyperparameters
Click to expand ### TAFT Datasets Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model: | Dataset | Learning Rate | Batch Size | Warmup Ratio | Classifier Pooling | Last k Layers | | ---------------------- | ------------- | ---------- | ------------ | ------------------ | ------------- | | adme_microsom_stab_h | 3e-5 | 8 | 0.0 | max_seq_mean | 5 | | adme_microsom_stab_r | 3e-5 | 16 | 0.2 | max_cls | 3 | | adme_permeability | 3e-5 | 8 | 0.0 | max_cls | 3 | | adme_ppb_h | 1e-5 | 32 | 0.1 | max_seq_mean | 5 | | adme_ppb_r | 1e-5 | 32 | 0.0 | sum_mean | N/A | | adme_solubility | 3e-5 | 32 | 0.0 | sum_mean | N/A | | astrazeneca_CL | 3e-5 | 8 | 0.1 | max_seq_mha | 3 | | astrazeneca_LogD74 | 1e-5 | 8 | 0.0 | max_seq_mean | 5 | | astrazeneca_PPB | 1e-5 | 32 | 0.0 | max_cls | 3 | | astrazeneca_Solubility | 1e-5 | 32 | 0.0 | max_seq_mean | 5 | ### Benchmarking Datasets Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model: | Dataset | Batch Size | Classifier Pooling | Last k Layers | Pooling Attention Dropout | Classifier Dropout | Embedding Dropout | | ------------------- | ---------- | ------------------ | ------------- | ------------------------- | ------------------ | ----------------- | | bace_classification | 32 | max_seq_mha | 3 | 0.0 | 0.0 | 0.0 | | bbbp | 64 | max_cls | 3 | 0.1 | 0.0 | 0.0 | | clintox | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | | hiv | 32 | max_seq_mha | 3 | 0.0 | 0.0 | 0.0 | | sider | 32 | mean | N/A | 0.1 | 0.0 | 0.1 | | tox21 | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | | base_regression | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | | clearance | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | | esol | 64 | sum_mean | N/A | 0.1 | 0.0 | 0.1 | | freesolv | 32 | max_seq_mha | 5 | 0.1 | 0.0 | 0.0 | | lipo | 32 | max_seq_mha | 3 | 0.1 | 0.1 | 0.1 |
## Hardware Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs. ## Citation If you use ModChemBERT in your research, please cite the checkpoint and the following: ``` @software{cortes-2025-modchembert, author = {Emmanuel Cortes}, title = {ModChemBERT: ModernBERT as a Chemical Language Model}, year = {2025}, publisher = {GitHub}, howpublished = {GitHub repository}, url = {https://github.com/emapco/ModChemBERT} } ``` ## References 1. Kallergis, Georgios, et al. "Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa." Communications Chemistry 8.1 (2025): 114. 2. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025). 3. Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025). 4. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024). 5. Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." Journal of Natural Language Processing 32.1 (2025): 176-218. 6. Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024). 7. Singh, Riya, et al. "ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models." (2025).