metadata
license: apache-2.0
tags:
- pytorch
- gpt2
- dna-genome
pipeline_tag: text-generation
molcrawl-genome-sequence-gpt2-medium
Model Description
GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the GRCh38 reference assembly.
Datasets
GRCh38 human genome reference assembly: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/ (Pre-training corpus)
Model Type: gpt2
Data Type: DNA/Genome
Training Date: 2026-04-14
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")
# Generate DNA/genome sequence
prompt = "ATCGATCGATCGATCGATCGATCGATCGATCG"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
eos_token_id=None, # HF config.json has legacy eos_token_id=0; disable early stop
pad_token_id=0,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Training
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
License
This model is released under the APACHE-2.0 license.
Citation
If you use this model, please cite:
@misc{molcrawl_genome_sequence_gpt2_medium,
title={molcrawl-genome-sequence-gpt2-medium},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-gpt2-medium}}
}