ProteinDance
/

ProteinSkier

Reinforcement Learning

text-generation

molecular-generation

structure-based-design

text-generation-inference

Model card Files Files and versions

MarvinCui commited on Aug 2, 2025

Commit

08e8a4c

·

verified ·

1 Parent(s): 0bf201c

Update README.md

Files changed (1) hide show

README.md +74 -3

README.md CHANGED Viewed

@@ -1,3 +1,74 @@
----
-license: mit
----

+---
+license: mit
+tags:
+- chemistry
+- molecular-generation
+- structure-based-design
+- transformer
+- reinforcement-learning
+- gpt-2
+datasets:
+- HUBioDataLab/DrugGEN-chembl-smiles
+- antoinebcx/smiles-molecules-moses
+base_model:
+- openai-community/gpt2
+pipeline_tag: reinforcement-learning
+library_name: transformers
+---
+# 🏂 ProteinSkier
+**ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**.
+## 1 · Why another generative model?
+Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
+ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards:
+| Component | Reward signal (λ) | Source |
+|-----------|-------------------|--------|
+| **Validity** | hard filter | RDKit sanitisation |
+| **QED ↑** | 0.35 | RDKit |
+| **Novelty ↑** | 0.25 | training-set hash table |
+| **Lipinski pass ↑** | 0.20 | RDKit |
+| **logP in [–1, 4]** | 0.10 | RDKit |
+| **Predicted tox ↓** | 0.10 | internal classifier |
+The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code).
+## 2 · Intended uses & scope
+| Stage | Example use-case | Not a good fit |
+|-------|------------------|----------------|
+| *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. |
+| *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. |
+| *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. |
+## 3 · Quick start
+> Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`.
+```python
+from transformers import AutoTokenizer, GPT2LMHeadModel
+model_id = "your-org/ProteinSkier"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = GPT2LMHeadModel.from_pretrained(model_id)
+# Generate 5 novel molecules
+prompt = tok("<bos>", return_tensors="pt").input_ids
+gen = model.generate(
+    prompt.repeat(5, 1),
+    max_length=128,
+    do_sample=True,
+    top_p=0.95,
+    temperature=0.7,
+)
+smiles = tok.batch_decode(gen, skip_special_tokens=True)
+print("\n".join(smiles))
+```
+## 4 · Limitations & caveats
+- No guaranteed synthesizability – always perform retrosynthetic analysis.
+- Property estimators used in RFT are fast; wet-lab assays will vary.
+- Output may include patented molecules – run IP checks.
+- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.