MarvinCui commited on
Commit
08e8a4c
·
verified ·
1 Parent(s): 0bf201c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - chemistry
5
+ - molecular-generation
6
+ - structure-based-design
7
+ - transformer
8
+ - reinforcement-learning
9
+ - gpt-2
10
+ datasets:
11
+ - HUBioDataLab/DrugGEN-chembl-smiles
12
+ - antoinebcx/smiles-molecules-moses
13
+ base_model:
14
+ - openai-community/gpt2
15
+ pipeline_tag: reinforcement-learning
16
+ library_name: transformers
17
+ ---
18
+
19
+ # 🏂 ProteinSkier
20
+
21
+ **ProteinSkier** is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward **ADMET quality, novelty, and synthesizability**.
22
+
23
+ ## 1 · Why another generative model?
24
+
25
+ Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
26
+ ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules **with a second-stage Reinforcement Fine-Tuning (RFT)** that rewards:
27
+
28
+ | Component | Reward signal (λ) | Source |
29
+ |-----------|-------------------|--------|
30
+ | **Validity** | hard filter | RDKit sanitisation |
31
+ | **QED ↑** | 0.35 | RDKit |
32
+ | **Novelty ↑** | 0.25 | training-set hash table |
33
+ | **Lipinski pass ↑** | 0.20 | RDKit |
34
+ | **logP in [–1, 4]** | 0.10 | RDKit |
35
+ | **Predicted tox ↓** | 0.10 | internal classifier |
36
+
37
+ The policy is updated with *policy-gradient REINFORCE*; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code).
38
+
39
+ ## 2 · Intended uses & scope
40
+
41
+ | Stage | Example use-case | Not a good fit |
42
+ |-------|------------------|----------------|
43
+ | *Hit finding* | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. |
44
+ | *Lead optimisation* | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. |
45
+ | *Ideation / teaching* | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. |
46
+
47
+ ## 3 · Quick start
48
+
49
+ > Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`.
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, GPT2LMHeadModel
53
+
54
+ model_id = "your-org/ProteinSkier"
55
+ tok = AutoTokenizer.from_pretrained(model_id)
56
+ model = GPT2LMHeadModel.from_pretrained(model_id)
57
+
58
+ # Generate 5 novel molecules
59
+ prompt = tok("<bos>", return_tensors="pt").input_ids
60
+ gen = model.generate(
61
+ prompt.repeat(5, 1),
62
+ max_length=128,
63
+ do_sample=True,
64
+ top_p=0.95,
65
+ temperature=0.7,
66
+ )
67
+ smiles = tok.batch_decode(gen, skip_special_tokens=True)
68
+ print("\n".join(smiles))
69
+ ```
70
+ ## 4 · Limitations & caveats
71
+ - No guaranteed synthesizability – always perform retrosynthetic analysis.
72
+ - Property estimators used in RFT are fast; wet-lab assays will vary.
73
+ - Output may include patented molecules – run IP checks.
74
+ - ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.