molcrawl-compounds-bert-small
Model Description
GPT-2 small (124M parameters) foundation model pre-trained on compound SMILES strings from the MolCrawl dataset.
The tokenizer is a character-level BPE tokenizer (vocab_size=612) that encodes each SMILES character as a separate token. Input SMILES strings should be passed without spaces (e.g. CC(=O)O). The [SEP] token (id=13) is used as the end-of-sequence marker.
Datasets
MolCrawl compounds corpus (chembl + zinc + opv + reddb + pubchemqc): https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus)
Model Type: bert
Data Type: Molecule/Compound
Training Date: 2026-04-13
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-compounds-bert-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-bert-small")
# Predict masked SMILES token
prompt = "CC(=O)[MASK]"
inputs = tokenizer(prompt, return_tensors="pt")
mask_token_id = tokenizer.mask_token_id
mask_index = (inputs["input_ids"] == mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_token_id = logits[0, mask_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
result = prompt.replace("[MASK]", predicted_token)
print(f"Predicted: {result}")
Training
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
License
This model is released under the APACHE-2.0 license.
Citation
If you use this model, please cite:
@misc{molcrawl_compounds_bert_small,
title={molcrawl-compounds-bert-small},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-compounds-bert-small}}
}
- Downloads last month
- 13