Note: DeepBERTa is not affiliated with Microsoft's DeBERTa model. The name is inspired by BERT-based models and refers to a DeepSMILES-trained RoBERTa variant.

DeepBERTa_zinc_base_100k_v4

A pretrained RoBERTa-style transformer model for learning molecular representations from DeepSMILES-encoded chemical structures. This model was trained on a corpus of 100,000 canonicalized molecules sampled from ZINC, using a BPE tokenizer with a vocabulary size of 767. It was trained similarily to ChemBERTa but with DeepSMILES instead of SMILES.

Preprocessing

All molecules were converted to DeepSMILES from SMILES before tokenization. DeepSMILES is a molecular notation for describing chemical structures that eliminates the need for explicit ring closure digits, making it more suitable for NLP approaches.

Model Overview

Architecture: RoBERTa-base (6 Transformer layers, 12 attention heads)
Objective: Masked Language Modeling (MLM)
Input Format: DeepSMILES strings (tokenized using custom BPE)
Training Framework: PyTorch + Hugging Face Transformers
Training Steps: ~0.35 epochs on 100k molecules
Final Evaluation Loss: 1.57
Model Type: RobertaForMaskedLM

Training Setup

Parameter	Value
Dataset	100,000 DeepSMILES from ZINC
Tokenizer Type	Byte-Pair Encoding (BPE)
Vocabulary Size	767
Masking Ratio	15%
Optimizer	AdamW
Batch Size	8 per device
Training Time	~0.35 epochs

Intended Usages

Unsupervised molecular representation learning
Use the pretrained model to encode molecules into contextual embeddings. These embeddings capture chemical structure and substructure information and can be used as input to other models.

Fine-tuning on downstream property classification tasks
Examples include:
- Toxicity prediction (e.g., Tox21, ToxCast)
- Brain toxicity (e.g., BBBP, B3DB)
- Binding affinity or bioactivity prediction (e.g., BACE, HIV)
To use the model for these tasks, add a classification head and fine-tune on labeled datasets.
Masked token prediction or augmentation

The MLM objective enables:
- Predicting missing or corrupted substructures in DeepSMILES strings
- Creating chemically valid perturbations for data augmentation

This model is not intended for:

De novo molecular generation
Masked Language Models are bidirectional and do not generate sequences autoregressively. For generation, use models like:
- RNN/LSTM-based generators
- Variational Autoencoders (Junction Tree VAE)
End-to-end property prediction without fine-tuning
This model does not come with a classification head. To predict properties, it must be fine-tuned on a task-specific dataset.
Direct use with raw SMILES (without converting to DeepSMILES)
This version is trained on DeepSMILES, which differs from canonical SMILES. Preprocessing is required to convert SMILES → DeepSMILES.

Example Usage (fragment prediction)

from transformers import RobertaForMaskedLM, RobertaTokenizer
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
import torch.optim as optim

# Load model and tokenizer from HuggingFace
model_path = "aakothari/DeepBERTa_zinc_base_100k_v1"
model = RobertaForMaskedLM.from_pretrained(model_path).to(device)
tokenizer = RobertaTokenizer.from_pretrained(model_path)

# Preprocess data 
train_ds = DeepSMILESDataset(train_df, tokenizer)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)

# Optimizer and scheduler
total_steps = len(train_loader) * NUM_EPOCHS
optimizer = optim.AdamW(model.parameters(), lr=5e-5, weight_decay=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(0.1 * total_steps), num_training_steps=total_steps)

# Training 
model.train()
for epoch in range(NUM_EPOCHS):
    for batch in train_loader:
        optimizer.zero_grad()
        inputs = {k: v.to(device) for k, v in batch.items() if k != "actual_fragment_deepsmiles"}
        outputs = model(**inputs)
        loss = outputs.loss 

        loss.backward()
        optimizer.step()
        scheduler.step()

Downloads last month: 4

Safetensors

Model size

44.1M params

Tensor type

F32