ProteinSkier / README.md
MarvinCui's picture
Update README.md
4106c06 verified
---
license: mit
tags:
- chemistry
- molecular-generation
- structure-based-design
- transformer
- reinforcement-learning
- gpt-2
datasets:
- HUBioDataLab/DrugGEN-chembl-smiles
- antoinebcx/smiles-molecules-moses
base_model:
- openai-community/gpt2
pipeline_tag: reinforcement-learning
library_name: transformers
---
# 🏂 ProteinSkier
**ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**.
## 1 · Why another generative model?
Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards:
| Component | Reward signal (λ) | Source |
|-----------|-------------------|--------|
| **Validity** | hard filter | RDKit sanitisation |
| **QED ↑** | 0.35 | RDKit |
| **Novelty ↑** | 0.25 | training-set hash table |
| **Lipinski pass ↑** | 0.20 | RDKit |
| **logP in [–1, 4]** | 0.10 | RDKit |
| **Predicted tox ↓** | 0.10 | internal classifier |
The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code).
## 2 · Intended uses & scope
| Stage | Example use-case | Not a good fit |
|-------|------------------|----------------|
| *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. |
| *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. |
| *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. |
## 3 · Quick start
> Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`.
```python
from transformers import AutoTokenizer, GPT2LMHeadModel
model_id = "ProteinDance/ProteinSkier"
tok = AutoTokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
# Generate 5 novel molecules
prompt = tok("<bos>", return_tensors="pt").input_ids
gen = model.generate(
prompt.repeat(5, 1),
max_length=128,
do_sample=True,
top_p=0.95,
temperature=0.7,
)
smiles = tok.batch_decode(gen, skip_special_tokens=True)
print("\n".join(smiles))
```
## 4 · Limitations & caveats
- No guaranteed synthesizability – always perform retrosynthetic analysis.
- Property estimators used in RFT are fast; wet-lab assays will vary.
- Output may include patented molecules – run IP checks.
- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.