|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- chemistry |
|
|
- molecular-generation |
|
|
- structure-based-design |
|
|
- transformer |
|
|
- reinforcement-learning |
|
|
- gpt-2 |
|
|
datasets: |
|
|
- HUBioDataLab/DrugGEN-chembl-smiles |
|
|
- antoinebcx/smiles-molecules-moses |
|
|
base_model: |
|
|
- openai-community/gpt2 |
|
|
pipeline_tag: reinforcement-learning |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# 🏂 ProteinSkier |
|
|
|
|
|
**ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**. |
|
|
|
|
|
## 1 · Why another generative model? |
|
|
|
|
|
Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters. |
|
|
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards: |
|
|
|
|
|
| Component | Reward signal (λ) | Source | |
|
|
|-----------|-------------------|--------| |
|
|
| **Validity** | hard filter | RDKit sanitisation | |
|
|
| **QED ↑** | 0.35 | RDKit | |
|
|
| **Novelty ↑** | 0.25 | training-set hash table | |
|
|
| **Lipinski pass ↑** | 0.20 | RDKit | |
|
|
| **logP in [–1, 4]** | 0.10 | RDKit | |
|
|
| **Predicted tox ↓** | 0.10 | internal classifier | |
|
|
|
|
|
The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code). |
|
|
|
|
|
## 2 · Intended uses & scope |
|
|
|
|
|
| Stage | Example use-case | Not a good fit | |
|
|
|-------|------------------|----------------| |
|
|
| *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. | |
|
|
| *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. | |
|
|
| *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. | |
|
|
|
|
|
## 3 · Quick start |
|
|
|
|
|
> Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, GPT2LMHeadModel |
|
|
|
|
|
model_id = "ProteinDance/ProteinSkier" |
|
|
tok = AutoTokenizer.from_pretrained(model_id) |
|
|
model = GPT2LMHeadModel.from_pretrained(model_id) |
|
|
|
|
|
# Generate 5 novel molecules |
|
|
prompt = tok("<bos>", return_tensors="pt").input_ids |
|
|
gen = model.generate( |
|
|
prompt.repeat(5, 1), |
|
|
max_length=128, |
|
|
do_sample=True, |
|
|
top_p=0.95, |
|
|
temperature=0.7, |
|
|
) |
|
|
smiles = tok.batch_decode(gen, skip_special_tokens=True) |
|
|
print("\n".join(smiles)) |
|
|
``` |
|
|
## 4 · Limitations & caveats |
|
|
- No guaranteed synthesizability – always perform retrosynthetic analysis. |
|
|
- Property estimators used in RFT are fast; wet-lab assays will vary. |
|
|
- Output may include patented molecules – run IP checks. |
|
|
- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials. |
|
|
|