MolCrawl/rna
Collection
9 items • Updated
GPT-2 large (774M parameters) foundation model pre-trained on RNA gene expression sequences from the MolCrawl dataset.
MolCrawl RNA gene expression sequence dataset: https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus)
Model Type: gpt2
Data Type: RNA
Training Date: 2026-04-14
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-rna-gpt2-large")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-rna-gpt2-large")
# Generate next gene-id tokens (RNA gene-list model)
prompt = "ENSG00000000003 ENSG00000000005 ENSG00000000419"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
eos_token_id=None, # HF config.json has legacy eos_token_id=0; disable early stop
pad_token_id=0,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.
This model is released under the APACHE-2.0 license.
If you use this model, please cite:
@misc{molcrawl_rna_gpt2_large,
title={molcrawl-rna-gpt2-large},
author={{RIKEN}},
year={2026},
publisher={{Hugging Face}},
url={{https://huggingface.co/kojima-lab/molcrawl-rna-gpt2-large}}
}