Upload 9 files

Browse files

Files changed (9) hide show

README.md +258 -0
config.json +39 -0
generation_config.json +9 -0
model.safetensors +3 -0
special_tokens_map.json +5 -0
test_generation.py +51 -0
tokenizer.json +0 -0
tokenizer_config.json +71 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,258 @@

+# PlasmidGPT-GRPO: Reinforcement Learning Fine-tuned Plasmid Generator
+[![W&B Run](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg)](https://wandb.ai/ucl-cssb/PlasmidRL/runs/u3wt9c50)
+**A biologically-constrained plasmid design model trained with reinforcement learning to generate functional DNA sequences.**
+This model is a fine-tuned version of [McClain/plasmidgpt-addgene-gpt2](https://huggingface.co/McClain/plasmidgpt-addgene-gpt2) (itself based on the original [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao), optimized using **Group Relative Policy Optimization (GRPO)** to generate plasmids that satisfy biological constraints.
+## 🎯 Key Improvements Over Base Model
+This RL-fine-tuned model has been trained to generate plasmids that:
+- ✅ Contain **correct numbers** of essential genetic elements (ori, promoters, terminators, markers, CDS)
+- ✅ Avoid **repeat regions** (>50 bp repeats penalized)
+- ✅ Generate **shorter, more efficient** sequences (rewarded for compactness)
+- ✅ Maintain **proper gene cassette organization** (promoter → CDS → terminator)
+- ✅ Achieve up to **1.0 reward score** for optimal plasmid design
+### Reward Structure
+The model was trained using a custom bioinformatics reward function that scores sequences based on:
+| Component | Min | Max | Weight | Description |
+|-----------|-----|-----|--------|-------------|
+| **Origin of Replication (ori)** | 1 | 1 | 1.5× | Essential for plasmid replication |
+| **Promoters** | 1 | 1 | 1.0× | Drive gene expression |
+| **Terminators** | 0 | 2 | 0.5× | Stop transcription |
+| **Selectable Markers** | 1 | 2 | 1.0× | Antibiotic resistance |
+| **Coding Sequences (CDS)** | 1 | 5 | 1.0× | Functional genes |
+**Additional Scoring:**
+- **Repeat Penalty**: -0.1 per repeat region ≥50 bp (including reverse complements)
+- **Length Bonus**: Rewards for shorter, more compact sequences (up to +0.5)
+- **Location Awareness**: Bonuses for correct gene cassette ordering and proximity
+**Maximum reward:** 1.0 (perfect plasmid with all constraints satisfied)
+## 🚀 Quick Start
+### Basic Sequence Generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+model = AutoModelForCausalLM.from_pretrained(
+    "McClain/plasmidgpt-grpo-rl",
+    trust_remote_code=True
+).to(device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(
+    "McClain/plasmidgpt-grpo-rl",
+    trust_remote_code=True
+)
+# Generate optimized plasmid sequence
+start_sequence = 'ATGGCTAGCGAATTC'
+input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
+outputs = model.generate(
+    input_ids,
+    max_length=400,
+    num_return_sequences=5,
+    temperature=0.8,
+    do_sample=True,
+    top_k=50,
+    top_p=0.95,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+for i, output in enumerate(outputs):
+    sequence = tokenizer.decode(output, skip_special_tokens=True)
+    print(f"Plasmid {i+1}: {len(sequence)} bp")
+```
+### Scoring Generated Plasmids
+To evaluate plasmids using the same reward function from training:
+```python
+# Install plasmidkit for annotation
+# pip install plasmidkit
+from plasmidrl.rewards import Scorer, RewardConfig
+# Use the same config as training
+reward_config = RewardConfig(
+    punish_mode=True,
+    length_reward_mode=False,
+    repeat_penalty_enabled=True,
+    repeat_min_length=50,
+    repeat_penalty_per_region=0.1,
+    ori_min=1, ori_max=1, ori_weight=1.5,
+    promoter_min=1, promoter_max=1, promoter_weight=1.0,
+    terminator_min=0, terminator_max=2, terminator_weight=0.5,
+    marker_min=1, marker_max=2, marker_weight=1.0,
+    cds_min=1, cds_max=5, cds_weight=1.0,
+    location_aware=True
+)
+scorer = Scorer(reward_config)
+score, components = scorer.score(generated_sequence)
+print(f"Reward Score: {score:.3f}")
+print(f"Components: {components}")
+```
+## 📊 Training Details
+### Training Configuration
+- **Base Model**: [McClain/plasmidgpt-addgene-gpt2](https://huggingface.co/McClain/plasmidgpt-addgene-gpt2)
+- **RL Algorithm**: GRPO (Group Relative Policy Optimization)
+- **Training Steps**: 2,500 steps
+- **Training Repository**: [PlasmidRL](https://github.com/McClain-Thiel/PlasmidRL)
+- **W&B Run**: [u3wt9c50](https://wandb.ai/ucl-cssb/PlasmidRL/runs/u3wt9c50)
+### Model Architecture
+| Parameter | Value |
+|-----------|-------|
+| **Architecture** | GPT-2 (Decoder-only Transformer) |
+| **Parameters** | 110 million |
+| **Layers** | 12 |
+| **Hidden Size** | 768 |
+| **Attention Heads** | 12 |
+| **Context Length** | 2048 tokens |
+| **Vocabulary Size** | 30,002 |
+### Framework Versions
+- **TRL**: 0.23.1
+- **Transformers**: 4.57.0
+- **PyTorch**: 2.8.0
+- **Datasets**: 4.1.1
+- **Tokenizers**: 0.22.1
+## 🧬 Use Cases
+1. **Optimized Plasmid Design**: Generate plasmids that satisfy specific biological constraints
+2. **Synthetic Biology**: Create novel genetic constructs for molecular cloning
+3. **Gene Cassette Engineering**: Design properly organized promoter-CDS-terminator cassettes
+4. **Compact Plasmid Construction**: Generate shorter plasmids while maintaining functionality
+5. **Repeat-Free Sequences**: Avoid problematic repeat regions in plasmid design
+## 🔗 Related Resources
+### Original PlasmidGPT
+This model builds upon the original PlasmidGPT work:
+- **Paper**: [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) (bioRxiv 2024.09.30.615762)
+- **Author**: Bin Shao (lingxusb)
+- **Original Repository**: [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)
+- **Original Model**: [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)
+### Training Infrastructure
+- **Training Code**: [github.com/McClain-Thiel/PlasmidRL](https://github.com/McClain-Thiel/PlasmidRL)
+- **W&B Project**: [ucl-cssb/PlasmidRL](https://wandb.ai/ucl-cssb/PlasmidRL)
+- **Base Model**: [McClain/plasmidgpt-addgene-gpt2](https://huggingface.co/McClain/plasmidgpt-addgene-gpt2)
+## 📚 Citations
+If you use this model, please cite:
+### This RL Model
+```bibtex
+@misc{thiel2024plasmidgpt_grpo,
+  title={PlasmidGPT-GRPO: Reinforcement Learning for Functional Plasmid Design},
+  author={Thiel, McClain},
+  year={2024},
+  howpublished={\url{https://github.com/McClain-Thiel/PlasmidRL}},
+  note={Training run: https://wandb.ai/ucl-cssb/PlasmidRL/runs/u3wt9c50}
+}
+```
+### Original PlasmidGPT
+```bibtex
+@article{shao2024plasmidgpt,
+  title={PlasmidGPT: a generative framework for plasmid design and annotation},
+  author={Shao, Bin and others},
+  journal={bioRxiv},
+  year={2024},
+  doi={10.1101/2024.09.30.615762},
+  url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
+}
+```
+### GRPO Algorithm
+```bibtex
+@article{shao2024deepseekmath,
+  title={{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
+  author={Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
+  journal={arXiv preprint arXiv:2402.03300},
+  year={2024}
+}
+```
+### TRL Library
+```bibtex
+@misc{vonwerra2022trl,
+  title={{TRL: Transformer Reinforcement Learning}},
+  author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
+  year={2020},
+  publisher={GitHub},
+  howpublished={\url{https://github.com/huggingface/trl}}
+}
+```
+## ⚙️ Technical Details
+### Reward Function Components
+The bioinformatics reward function (`src/rewards/bioinformatics/scorer.py`) includes:
+1. **Feature Counting**: Uses [PlasmidKit](https://github.com/jbloomlab/plasmidkit) for automated annotation
+2. **Overlap Merging**: Intelligently merges overlapping features (80% threshold)
+3. **CDS Filtering**: Removes CDS annotations overlapping with ori/promoter/terminator/marker
+4. **Strand Awareness**: Considers strand orientation for gene cassette scoring
+5. **Repeat Detection**: Finds direct and reverse complement repeats using k-mer indexing
+6. **Proximity Scoring**: Rewards features within 300 bp for proper cassette formation
+### Training Hyperparameters
+View complete hyperparameters and metrics on [W&B](https://wandb.ai/ucl-cssb/PlasmidRL/runs/u3wt9c50).
+## ⚠️ Important Notes
+- **Research Use Only**: Generated plasmids should be validated before experimental use
+- **Annotation Dependency**: Scoring requires `plasmidkit` for feature annotation
+- **Compute Requirements**: GPU recommended for generation (CPU fallback available)
+- **Sequence Validation**: Always verify generated sequences contain expected features
+## 📄 License
+This model inherits licensing from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for details.
+## 🙏 Acknowledgments
+- **Bin Shao (lingxusb)** for the original PlasmidGPT model and architecture
+- **Addgene** for providing the training data (153k plasmid sequences)
+- **HuggingFace TRL team** for the GRPO implementation
+- **UCL CSSB** for computational resources
+---
+**Model Version**: grpo-production-20251110_132247
+**Training Date**: November 10, 2025
+**Last Updated**: November 13, 2025

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 30000,
+  "dtype": "float32",
+  "embd_pdrop": 0.1,
+  "eos_token_id": 30001,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 2048,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 2048,
+  "pad_token_id": 3,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.1,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "transformers_version": "4.57.0",
+  "use_cache": true,
+  "vocab_size": 30002
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 30000,
+  "eos_token_id": [
+    30001
+  ],
+  "pad_token_id": 3,
+  "transformers_version": "4.57.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e290ca1ff16f34af23f74de1d660398209b66fad9fea9ba6065f5b1426ce1eb
+size 235269120

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "[PAD]"
+}

test_generation.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+print(f"Using device: {device}\n")
+print("Loading RL-optimized PlasmidGPT-GRPO model...")
+model = AutoModelForCausalLM.from_pretrained(
+    ".",
+    trust_remote_code=True
+).to(device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(
+    ".",
+    trust_remote_code=True
+)
+print("Generating optimized plasmid sequences...\n")
+start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
+print(f"Start sequence: {start_sequence}\n")
+input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
+outputs = model.generate(
+    input_ids,
+    max_length=400,
+    num_return_sequences=3,
+    temperature=0.8,
+    do_sample=True,
+    top_k=50,
+    top_p=0.95,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+print("=" * 80)
+for i, output in enumerate(outputs, 1):
+    sequence = tokenizer.decode(output, skip_special_tokens=True)
+    print(f"\nPlasmid {i}:")
+    print(f"  Length: {len(sequence)} bp")
+    print(f"  First 100 bp: {sequence[:100]}")
+    print(f"  Last 100 bp: {sequence[-100:]}")
+print("\n" + "=" * 80)
+print("\nNote: These sequences are generated by an RL-optimized model trained to:")
+print("  ✓ Include proper genetic elements (ori, promoters, CDS, markers)")
+print("  ✓ Avoid repeat regions > 50 bp")
+print("  ✓ Generate compact, functional plasmids")
+print("  ✓ Organize genes in proper cassettes (promoter → CDS → terminator)")

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30000": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30001": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "max_length": null,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "left",
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e38dae2f73a0f51976b1a463bd135d46624f945c6fd07f96a168b9f33e315d7
+size 7377