--- license: mit tags: - chemistry - molecular-generation - structure-based-design - transformer - reinforcement-learning - gpt-2 datasets: - HUBioDataLab/DrugGEN-chembl-smiles - antoinebcx/smiles-molecules-moses base_model: - openai-community/gpt2 pipeline_tag: reinforcement-learning library_name: transformers --- # 🏂 ProteinSkier **ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**. ## 1 · Why another generative model? Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters. ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards: | Component | Reward signal (λ) | Source | |-----------|-------------------|--------| | **Validity** | hard filter | RDKit sanitisation | | **QED ↑** | 0.35 | RDKit | | **Novelty ↑** | 0.25 | training-set hash table | | **Lipinski pass ↑** | 0.20 | RDKit | | **logP in [–1, 4]** | 0.10 | RDKit | | **Predicted tox ↓** | 0.10 | internal classifier | The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code). ## 2 · Intended uses & scope | Stage | Example use-case | Not a good fit | |-------|------------------|----------------| | *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. | | *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. | | *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. | ## 3 · Quick start > Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`. ```python from transformers import AutoTokenizer, GPT2LMHeadModel model_id = "ProteinDance/ProteinSkier" tok = AutoTokenizer.from_pretrained(model_id) model = GPT2LMHeadModel.from_pretrained(model_id) # Generate 5 novel molecules prompt = tok("", return_tensors="pt").input_ids gen = model.generate( prompt.repeat(5, 1), max_length=128, do_sample=True, top_p=0.95, temperature=0.7, ) smiles = tok.batch_decode(gen, skip_special_tokens=True) print("\n".join(smiles)) ``` ## 4 · Limitations & caveats - No guaranteed synthesizability – always perform retrosynthetic analysis. - Property estimators used in RFT are fast; wet-lab assays will vary. - Output may include patented molecules – run IP checks. - ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.