molcrawl-compounds-chembl-gpt2-medium

Model Description

GPT-2 medium (345M parameters) fine-tuned on ChEMBL compound SMILES data, starting from the molcrawl-compounds-gpt2-medium pre-trained model.

The tokenizer is a character-level BPE tokenizer (vocab_size=612). Input SMILES strings should be passed without spaces. The [SEP] token (id=13) is used as the end-of-sequence marker.

Datasets

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-compounds-chembl-gpt2-medium")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-compounds-chembl-gpt2-medium")

# Generate SMILES string
prompt = "CC(=O)O"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        eos_token_id=tokenizer.convert_tokens_to_ids("[SEP]"),  # [SEP] is EOS for compounds
        pad_token_id=0,
    )
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_compounds_chembl_gpt2_medium,
  title={molcrawl-compounds-chembl-gpt2-medium},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-compounds-chembl-gpt2-medium}}
}
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kojima-lab/molcrawl-compounds-chembl-gpt2-medium