molcrawl-protein-sequence-gpt2-small

Model Description

GPT-2 small (124M parameters) foundation model pre-trained on protein amino acid sequences from the MolCrawl dataset.

  • Model Type: gpt2
  • Data Type: Protein
  • Training Date: 2026-03-30

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-protein-sequence-gpt2-small")

# Generate protein sequence
prompt = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGT"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        eos_token_id=None,   # disable early stop at token 0 (training artefact)
        pad_token_id=0,
    )
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_protein_sequence_gpt2_small,
  title={molcrawl-protein-sequence-gpt2-small},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-protein-sequence-gpt2-small}}
}
Downloads last month
7
Safetensors
Model size
85.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including kojima-lab/molcrawl-protein-sequence-gpt2-small