MolEncoder / README.md

fabikru

Update README.md

036e0d3 verified 3 months ago

preview code

raw

history blame

2.12 kB

metadata

library_name: transformers
tags:
  - smiles
  - chemistry
  - BERT
  - molecules
license: mit
datasets:
  - fabikru/half-of-chembl-2025-randomized-smiles-cleaned

MolEncoder

MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules".

Model Description

Architecture: Encoder-only transformer based on ModernBERT
Parameters: ~15M
Tokenizer: Character-level tokenizer covering full SMILES vocabulary
Pretraining Objective: Masked language modeling with optimized masking ratios (30% found to work best for molecules)
Pretraining Data: Pretrained on ~1M molecules (half of ChEMBL)

Key Findings

Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.
Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.
This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.

Intended Uses

Primary use: Molecular property prediction through fine-tuning on downstream datasets

How to Use

Please refer to the MolEncoder GitHub repository for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.

Citation

If you use this model, please cite:

@article{Krüger_Österbacka_Kabeshov_Engkvist_Tetko_2025, title={MolEncoder: Towards Optimal Masked Language Modeling for Molecules}, DOI={10.26434/chemrxiv-2025-h4w9d}, journal={ChemRxiv}, author={Krüger, Fabian Per and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor}, year={2025}}  This content is a preprint and has not been peer-reviewed.