ProteinDance
/

ProteinSkier

Reinforcement Learning

text-generation

molecular-generation

structure-based-design

text-generation-inference

Model card Files Files and versions

ProteinSkier / README.md

MarvinCui's picture

Update README.md

4106c06 verified 6 months ago

|

history blame contribute delete

2.88 kB

	---
	license: mit
	tags:
	- chemistry
	- molecular-generation
	- structure-based-design
	- transformer
	- reinforcement-learning
	- gpt-2
	datasets:
	- HUBioDataLab/DrugGEN-chembl-smiles
	- antoinebcx/smiles-molecules-moses
	base_model:
	- openai-community/gpt2
	pipeline_tag: reinforcement-learning
	library_name: transformers
	---

	# 🏂 ProteinSkier

	ProteinSkier is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward ADMET quality, novelty, and synthesizability.

	## 1 · Why another generative model?

	Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
	ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules with a second-stage Reinforcement Fine-Tuning (RFT) that rewards:

	\| Component \| Reward signal (λ) \| Source \|
	\|-----------\|-------------------\|--------\|
	\| Validity \| hard filter \| RDKit sanitisation \|
	\| QED ↑ \| 0.35 \| RDKit \|
	\| Novelty ↑ \| 0.25 \| training-set hash table \|
	\| Lipinski pass ↑ \| 0.20 \| RDKit \|
	\| logP in [–1, 4] \| 0.10 \| RDKit \|
	\| Predicted tox ↓ \| 0.10 \| internal classifier \|

	The policy is updated with policy-gradient REINFORCE; low-quality trajectories are rejected via an adaptive threshold (see `FullDatasetRFTTrainer` in the code).

	## 2 · Intended uses & scope

	\| Stage \| Example use-case \| Not a good fit \|
	\|-------\|------------------\|----------------\|
	\| Hit finding \| Rapidly scaffold-hop around a weak binder identified by docking. \| Predicting absolute IC₅₀/Kᵢ values. \|
	\| Lead optimisation \| Generating analogues that respect Lipinski & BBB guidelines. \| Ensuring synthetic accessibility without chemist review. \|
	\| Ideation / teaching \| Demonstrating language-model chemistry in the classroom. \| Production-scale enumeration without downstream filtering. \|

	## 3 · Quick start

	> Requires `transformers ≥ 4.42`, `torch ≥ 2.2`, `rdkit`, `accelerate`.

	```python
	from transformers import AutoTokenizer, GPT2LMHeadModel

	model_id = "ProteinDance/ProteinSkier"
	tok = AutoTokenizer.from_pretrained(model_id)
	model = GPT2LMHeadModel.from_pretrained(model_id)

	# Generate 5 novel molecules
	prompt = tok("<bos>", return_tensors="pt").input_ids
	gen = model.generate(
	prompt.repeat(5, 1),
	max_length=128,
	do_sample=True,
	top_p=0.95,
	temperature=0.7,
	)
	smiles = tok.batch_decode(gen, skip_special_tokens=True)
	print("\n".join(smiles))
	```
	## 4 · Limitations & caveats
	- No guaranteed synthesizability – always perform retrosynthetic analysis.
	- Property estimators used in RFT are fast; wet-lab assays will vary.
	- Output may include patented molecules – run IP checks.
	- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.