SaeedLab
/

MolDeBERTa-tiny-10M-mlm

@@ -18,7 +18,7 @@ This model corresponds to the MolDeBERTa tiny architecture pretrained on the 10M
 ## Abstract
-Encoder-based molecular transformer foundation models for SMILES strings have become the dominant paradigm for learning molecular representations, achieving substantial progress across a wide range of downstream chemical tasks. Despite these advances, most existing models rely on first-generation transformer architectures and are predominantly pretrained using masked language modeling—a generic objective that fails to explicitly encode physicochemical or structural information. In this work, we introduce MolDeBERTa, an encoder-based molecular framework built upon the DeBERTaV2 architecture and pretrained on large-scale SMILES data. We systematically investigate the interplay between model scale, pretraining dataset size, and pretraining objective by training 30 MolDeBERTa variants across three architectural scales, two dataset sizes, and five distinct pretraining objectives. Crucially, we introduce three novel pretraining objectives designed to inject strong inductive biases regarding molecular properties and structural similarity directly into the model's latent space. Across nine downstream benchmarks from MoleculeNet, MolDeBERTa achieves state-of-the-art performance on 7 out of 9 tasks under a rigorous fine-tuning protocol. Our results demonstrate that chemically grounded pretraining objectives consistently outperform standard masked language modeling. Finally, based on atom-level interpretability analyses, we provide qualitative evidence that MolDeBERTa learns task-specific molecular representations, highlighting chemically relevant substructures in a manner consistent with known physicochemical principles. These results establish MolDeBERTa as a robust encoder-based foundation model for chemistry-informed representation learning.
 ## Model Details

 ## Abstract
+Foundational models that learn the "language" of molecules are essential for accelerating material and drug discovery. These self-learning models can be trained on large collections of unlabelled molecules, enabling applications such as property prediction, molecule design, and screening for specific functions. However, existing molecular language models rely on masked language modeling, a generic token-level objective that is agnostic to physicochemical and substructure molecular properties. Here we introduce MolDeBERTa, a chemistry-informed self-supervised molecular encoder built upon the DeBERTaV2 architecture with byte-level Byte-Pair Encoding (BPE) tokenization. MolDeBERTa is pretrained on up to 123 million SMILES from PubChem using three novel pretraining objectives designed to inject strong inductive biases for molecular properties and substructure similarity directly into the latent space. The model is systematically investigated across three architectural scales, two dataset sizes, and five distinct pretraining objectives, of which three are novel and two are adapted from prior work. When evaluated on 9 MoleculeNet benchmarks, MolDeBERTa achieves the best overall performance on 4 out of 9 tasks and outperforms SMILES-based encoders on 7 out of 9 tasks, with up to a 16% reduction in regression error, and improvements of up to 2.2 ROC-AUC points on classification tasks.
 ## Model Details