molcrawl-molecule-nat-lang-gpt2-small

Model Description

GPT-2 small (124M parameters) foundation model pre-trained on molecule-related natural language text using a standard GPT-2 BPE tokenizer (vocab_size=50257).

Datasets

MolCrawl molecule natural language corpus: https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus)
Model Type: gpt2
Data Type: Molecule-NL
Training Date: 2026-04-14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-molecule-nat-lang-gpt2-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-molecule-nat-lang-gpt2-small")

# Generate molecule-related text
prompt = "The compound with SMILES CC(=O)Oc1ccccc1C(=O)O represents aspirin, which"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.8,
        eos_token_id=None,  # HF config.json has legacy eos_token_id=0; disable early stop
        pad_token_id=0,
    )
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_molecule_nat_lang_gpt2_small,
  title={molcrawl-molecule-nat-lang-gpt2-small},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-molecule-nat-lang-gpt2-small}}
}

Downloads last month: 382

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including kojima-lab/molcrawl-molecule-nat-lang-gpt2-small

MolCrawl/molecule_nat_lang

Collection

9 items • Updated 6 days ago