molcrawl-compounds-chembl-gpt2-small
Model Description
GPT-2 small (124M parameters) fine-tuned on ChEMBL compound SMILES data, starting from the molcrawl-compounds-gpt2-small pre-trained model.
The tokenizer is a character-level BPE tokenizer (vocab_size=612). Input SMILES strings should be passed without spaces. The [SEP] token (id=13) is used as the end-of-sequence marker.
Datasets
ChEMBL: https://www.ebi.ac.uk/chembl/ (Fine-tuning dataset)
MolCrawl compounds corpus: https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus used by the base model)
Model Type: gpt2
Data Type: Molecule/Compound
Training Date: 2026-04-14
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-compounds-chembl-gpt2-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-chembl-gpt2-small")
# Generate SMILES string
prompt = "CC(=O)O"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
eos_token_id=tokenizer.convert_tokens_to_ids("[SEP]"), # [SEP] is EOS for compounds
pad_token_id=0,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Training
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
License
This model is released under the APACHE-2.0 license.
Citation
If you use this model, please cite:
@misc{molcrawl_compounds_chembl_gpt2_small,
title={molcrawl-compounds-chembl-gpt2-small},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-compounds-chembl-gpt2-small}}
}
- Downloads last month
- 254