deskull's picture
Update model card: restore eos_token_id=None workaround, fix BERT mask_token, correct RNA prompt
c0b5a0a verified
metadata
license: apache-2.0
tags:
  - pytorch
  - gpt2
  - dna-genome
pipeline_tag: text-generation

molcrawl-genome-sequence-gpt2-medium

Model Description

GPT-2 medium (345M parameters) foundation model pre-trained on human genome DNA sequences from the GRCh38 reference assembly.

Datasets

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-genome-sequence-gpt2-medium")

# Generate DNA/genome sequence
prompt = "ATCGATCGATCGATCGATCGATCGATCGATCG"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        eos_token_id=None,  # HF config.json has legacy eos_token_id=0; disable early stop
        pad_token_id=0,
    )
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_genome_sequence_gpt2_medium,
  title={molcrawl-genome-sequence-gpt2-medium},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-genome-sequence-gpt2-medium}}
}