MolCrawl/All
Collection
12 items • Updated
GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the GRCh38 reference assembly.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")
# Generate DNA/genome sequence
prompt = "ATCGATCGATCGATCGATCGATCGATCGATCG"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
eos_token_id=None, # disable early stop at token 0 (training artefact)
pad_token_id=0,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
This model is released under the APACHE-2.0 license.
If you use this model, please cite:
@misc{molcrawl_genome_sequence_medium_gpt2,
title={molcrawl-genome-sequence-gpt2-medium},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-gpt2-medium}}
}