Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant.
DeepBERTa_zinc_base_100k_v4
A pretrained RoBERTa-style transformer model for learning molecular representations from DeepSMILES-encoded chemical structures. This model was trained on a corpus of 100,000 canonicalized molecules sampled from ZINC, using a BPE tokenizer with a vocabulary size of 767. It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES.
Preprocessing
All molecules were converted to DeepSMILES from SMILES before tokenization. DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches.
Model Overview
- Architecture: RoBERTa-base (6 Transformer layers, 12 attention heads)
- Objective: Masked Language Modeling (MLM)
- Input Format: DeepSMILES strings (tokenized using custom BPE)
- Training Framework: PyTorch + Hugging Face Transformers
- Training Steps: ~0.35 epochs on 100k molecules
- Final Evaluation Loss: 1.57
- Model Type:
RobertaForMaskedLM
Training Setup
| Parameter | Value |
|---|---|
| Dataset | 100,000 DeepSMILES from ZINC |
| Tokenizer Type | Byte-Pair Encoding (BPE) |
| Vocabulary Size | 767 |
| Masking Ratio | 15% |
| Optimizer | AdamW |
| Batch Size | 8 per device |
| Training Time | ~0.35 epochs |
Intended Usages
Unsupervised molecular representation learning
Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models.
Fine-tuning on downstream property classification tasks
Examples include:- Toxicity prediction (e.g., Tox21, ToxCast)
- Brain toxicity (e.g., BBBP, B3DB)
- Binding affinity or bioactivity prediction (e.g., BACE, HIV)
To use the model for these tasks, add a classification head and fine-tune on labeled datasets.
Masked token prediction or augmentation
The MLM objective enables:
- Predicting missing or corrupted substructures in DeepSMILES strings
- Creating chemically valid perturbations for data augmentation
This model is not intended for:
De novo molecular generation
Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like:- RNN/LSTM-based generators
- Variational Autoencoders (Junction Tree VAE)
End-to-end property prediction without fine-tuning
This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset.Direct use with raw SMILES (without converting to DeepSMILES)
This version is trained on DeepSMILES, which differs from canonical SMILES. Preprocessing is required to convert SMILES โ DeepSMILES.
Example Usage (fragment prediction)
from transformers import RobertaForMaskedLM, RobertaTokenizer
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
import torch.optim as optim
# Load model and tokenizer from HuggingFace
model_path = "aakothari/DeepBERTa_zinc_base_100k_v1"
model = RobertaForMaskedLM.from_pretrained(model_path).to(device)
tokenizer = RobertaTokenizer.from_pretrained(model_path)
# Preprocess data
train_ds = DeepSMILESDataset(train_df, tokenizer)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)
# Optimizer and scheduler
total_steps = len(train_loader) * NUM_EPOCHS
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps)
# Training
model.train()
for epoch in range(NUM_EPOCHS):
for batch in train_loader:
optimizer.zero_grad()
inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"}
outputs = model(**inputs)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
- Downloads last month
- 47