|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- smiles |
|
|
- chemistry |
|
|
- BERT |
|
|
- molecules |
|
|
license: mit |
|
|
datasets: |
|
|
- fabikru/half-of-chembl-2025-randomized-smiles-cleaned |
|
|
--- |
|
|
|
|
|
# MolEncoder |
|
|
|
|
|
MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules". |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Architecture:** Encoder-only transformer based on ModernBERT |
|
|
- **Parameters:** ~15M |
|
|
- **Tokenizer:** Character-level tokenizer covering full SMILES vocabulary |
|
|
- **Pretraining Objective:** Masked language modeling with optimized masking ratios (30% found to work best for molecules) |
|
|
- **Pretraining Data:** Pretrained on ~1M molecules (half of ChEMBL) |
|
|
|
|
|
## Key Findings |
|
|
|
|
|
- Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models. |
|
|
- Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency. |
|
|
- This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings. |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
- **Primary use:** Molecular property prediction through fine-tuning on downstream datasets |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
```bibtex |
|
|
@article{Krüger_Österbacka_Kabeshov_Engkvist_Tetko_2025, title={MolEncoder: Towards Optimal Masked Language Modeling for Molecules}, DOI={10.26434/chemrxiv-2025-h4w9d}, journal={ChemRxiv}, author={Krüger, Fabian Per and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor}, year={2025}} This content is a preprint and has not been peer-reviewed. |
|
|
``` |