deskull's picture
Update model card: fix usage code and descriptions
c24b3d4 verified
metadata
license: apache-2.0
tags:
  - pytorch
  - bert
  - molecule-compound
pipeline_tag: fill-mask

molcrawl-compounds-chembl-bert-small

Model Description

GPT-2 small (124M parameters) fine-tuned on ChEMBL compound SMILES data, starting from the molcrawl-compounds-gpt2-small pre-trained model.

The tokenizer is a character-level BPE tokenizer (vocab_size=612). Input SMILES strings should be passed without spaces. The [SEP] token (id=13) is used as the end-of-sequence marker.

Datasets

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-compounds-chembl-bert-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-chembl-bert-small")

# Predict masked SMILES token
prompt = "CC(=O)[MASK]"
inputs = tokenizer(prompt, return_tensors="pt")
mask_token_id = tokenizer.mask_token_id
mask_index = (inputs["input_ids"] == mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits

predicted_token_id = logits[0, mask_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
result = prompt.replace("[MASK]", predicted_token)
print(f"Predicted: {result}")

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_compounds_chembl_bert_small,
  title={molcrawl-compounds-chembl-bert-small},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-compounds-chembl-bert-small}}
}