MolCrawl/protein_sequence
Collection
9 items • Updated • 1
GPT-2 large (774M parameters) foundation model pre-trained on protein amino acid sequences from the MolCrawl dataset.
MolCrawl protein sequence dataset: https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus)
Model Type: gpt2
Data Type: Protein
Training Date: 2026-04-14
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-large")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-large")
# Generate protein sequence
prompt = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGT"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=0,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
This model is released under the APACHE-2.0 license.
If you use this model, please cite:
@misc{molcrawl_protein_sequence_gpt2_large,
title={molcrawl-protein-sequence-gpt2-large},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-protein-sequence-gpt2-large}}
}