Feature Extraction
Transformers
Safetensors
deberta-v2
fill-mask
chemistry
bioinformatics
drug-discovery
text-embeddings-inference
Instructions to use SaeedLab/MolDeBERTa-tiny-10M-mlm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SaeedLab/MolDeBERTa-tiny-10M-mlm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="SaeedLab/MolDeBERTa-tiny-10M-mlm")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("SaeedLab/MolDeBERTa-tiny-10M-mlm") model = AutoModelForMaskedLM.from_pretrained("SaeedLab/MolDeBERTa-tiny-10M-mlm") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ This model corresponds to the MolDeBERTa tiny architecture pretrained on the 10M
|
|
| 18 |
|
| 19 |
## Abstract
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
|
|
| 18 |
|
| 19 |
## Abstract
|
| 20 |
|
| 21 |
+
Foundational models that learn the "language" of molecules are essential for accelerating material and drug discovery. These self-learning models can be trained on large collections of unlabelled molecules, enabling applications such as property prediction, molecule design, and screening for specific functions. However, existing molecular language models rely on masked language modeling, a generic token-level objective that is agnostic to physicochemical and substructure molecular properties. Here we introduce MolDeBERTa, a chemistry-informed self-supervised molecular encoder built upon the DeBERTaV2 architecture with byte-level Byte-Pair Encoding (BPE) tokenization. MolDeBERTa is pretrained on up to 123 million SMILES from PubChem using three novel pretraining objectives designed to inject strong inductive biases for molecular properties and substructure similarity directly into the latent space. The model is systematically investigated across three architectural scales, two dataset sizes, and five distinct pretraining objectives, of which three are novel and two are adapted from prior work. When evaluated on 9 MoleculeNet benchmarks, MolDeBERTa achieves the best overall performance on 4 out of 9 tasks and outperforms SMILES-based encoders on 7 out of 9 tasks, with up to a 16% reduction in regression error, and improvements of up to 2.2 ROC-AUC points on classification tasks.
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|