SMolGen
Collection
8 items • Updated
A 360M-parameter causal language model for de novo molecule generation trained on SMILES strings from PubChem.
The model was pretrained on ~40 million molecules sourced from PubChem and filtered by:
Decoder-only Transformer (LlamaForCausalLM) with grouped-query attention (GQA):
| Parameter | Value |
|---|---|
| Hidden size | 960 |
| Intermediate size | 2560 |
| Layers | 32 |
| Attention heads | 15 (5 KV heads) |
| Max sequence length | 8192 |
| Vocabulary size | 36 |
This model uses the REINVENT4 tokenizer — a chemistry-aware tokenizer that splits SMILES strings based on a hand-crafted regex covering atoms, bonds, ring closures, branches, and bracket atoms. The vocabulary has 36 tokens.
Pass an empty string to prompt the model to generate novel SMILES from scratch:
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
model = AutoModelForCausalLM.from_pretrained("ddidacus/smolgen-pubchem-360M-base")
tokenizer = PreTrainedTokenizerFast.from_pretrained("ddidacus/smolgen-pubchem-360M-base")
inputs = tokenizer("", return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=1.0,
num_return_sequences=10,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
smiles_list = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(smiles_list)