---
license: mit
tags:
- chemistry
- molecular-generation
- structure-based-design
- transformer
- reinforcement-learning
- gpt-2
datasets:
- HUBioDataLab/DrugGEN-chembl-smiles
- antoinebcx/smiles-molecules-moses
base_model:
- openai-community/gpt2
pipeline_tag: reinforcement-learning
library_name: transformers
---

# 🏂 ProteinSkier

**ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**.

## 1 · Why another generative model?

Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.  
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards:

| Component | Reward signal (λ) | Source |
|-----------|-------------------|--------|
| **Validity** | hard filter | RDKit sanitisation |
| **QED ↑** | 0.35 | RDKit |
| **Novelty ↑** | 0.25 | training-set hash table |
| **Lipinski pass ↑** | 0.20 | RDKit |
| **logP in [–1, 4]** | 0.10 | RDKit |
| **Predicted tox ↓** | 0.10 | internal classifier |

The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code).

## 2 · Intended uses & scope

| Stage | Example use-case | Not a good fit |
|-------|------------------|----------------|
| *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. |
| *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. |
| *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. |

## 3 · Quick start

> Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`.

```python
from transformers import AutoTokenizer, GPT2LMHeadModel

model_id = "ProteinDance/ProteinSkier"
tok = AutoTokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)

# Generate 5 novel molecules
prompt = tok("<bos>", return_tensors="pt").input_ids
gen = model.generate(
    prompt.repeat(5, 1),
    max_length=128,
    do_sample=True,
    top_p=0.95,
    temperature=0.7,
)
smiles = tok.batch_decode(gen, skip_special_tokens=True)
print("\n".join(smiles))
```
## 4 · Limitations & caveats
- No guaranteed synthesizability – always perform retrosynthetic analysis.
- Property estimators used in RFT are fast; wet-lab assays will vary.
- Output may include patented molecules – run IP checks.
- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.