Buckets:

McClain's picture
|
download
raw
6.71 kB
# PlasmidRL — ICML Revision Data
Experimental data for: **Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators**
## Best Model Configuration
**UCL-CSSB/PlasmidGPT-GRPO at temperature=1.0** achieves the best quality-diversity tradeoff:
- **71.6% QC pass rate** (ORI + AMR + no repeats)
- **0.573 diversity** (1 − mean pairwise Jaccard of 21-mer MinHash)
- **−0.149 kcal/mol/nt MFE density** (DNA parameters)
- Mean sequence length: 6,517 bp
A separately trained `McClain/PlasmidGPT-RL` run at temperature=0.95 underperforms
(53.7% pass rate, 0.132 diversity) and is retained only as an ablation baseline.
It is **not** the model used for any headline result in the paper.
## Bucket Structure
```
analysis/ Summary metric CSVs
├── full_ablation_metrics.csv 8-model ablation comparison
├── baselines_qc_metrics.csv rejection sampling QC results
├── rl_per_prompt_metrics.csv RL per-prompt breakdown
├── rl_temp_sweep_final.csv temperature sweep (RL)
└── mfe_summary_all.csv MFE across all models
baselines/ Rejection sampling raw sequences
├── rejection_sampling/{Base,SFT,GRPO}/ 10K samples each + metadata
└── best_of_16/{Base,SFT,GRPO}/ 16K samples each + metadata
eval_8prompt/ Main 8-prompt evaluation, T=1.0
└── {Base,SFT}/
├── *_metrics.csv per-sequence metrics (4,000 rows)
└── *_summary.json aggregate stats
generations/ Multi-temperature generation
├── temp_0.8/{Base,RL,RL_cds_only,RL_length_only,
│ RL_no_cassette,RL_no_length,RL_no_repeat}/
├── temp_0.95/{Base,SFT,RL,RL_cds_only,RL_length_only,
│ RL_no_cassette,RL_no_length,RL_no_repeat}/
└── temp_1.1/{Base,RL,RL_cds_only,RL_length_only,
RL_no_cassette,RL_no_length,RL_no_repeat}/
NOTE: SFT generations exist only at temp_0.95
(resampled 2026-04-21). Earlier uploads at temp_0.8
and temp_1.1 were byte-identical duplicates of the
corresponding Base outputs and have been deleted.
generations_sweep/ Temperature sweep (RL + GRPO)
├── temp_{0.3,0.5,0.7}/{RL,GRPO}/
├── temp_{0.9}/{GRPO}/
└── temp_{1.0}/{RL,GRPO}/
mfe/ ViennaRNA MFE density (DNA + RNA params)
└── {Base,SFT,RL,RL_cds_only,RL_length_only,RL_no_cassette,
RL_no_length,RL_no_repeat,GRPO_temp1.0,GRPO_temp0.9}/
├── mfe_results.csv per-sequence MFE (both RNA and DNA params)
└── mfe_summary.json mean ± std
qc_results/ QC for ablation study (temp=0.95, 8 models)
└── {Base,SFT,RL,...}/passed.csv, failed.csv, aggregate_*.csv, repeats.csv
qc_baselines/ QC for rejection sampling
└── {rejection_sampling,best_of_16}/{Base,SFT,GRPO}/
qc_sweep/ QC for temperature sweep
└── temp_{0.3,...,1.0}/{RL_vllm,GRPO}/
original_paper/ Frozen snapshots from the pre-revision draft
```
## Key Results
### MFE Density (DNA parameters, kcal/mol/nt)
All MFE numbers computed with ViennaRNA 2.7.2, Mathews 2004 DNA energy
parameters. Short sequences (≤3 kb) folded circularly; longer sequences
use a 500-bp sliding window (stride 250 bp).
| Model | DNA MFE density | n | Notes |
|---|---:|---:|---|
| Base | −0.1055 ± 0.0756 | 4000 | baseline, T=0.95 |
| SFT | −0.1492 ± 0.0284 | 4000 | T=0.95 (resampled 2026-04-21; previous upload was a Base duplicate) |
| **GRPO (temp=1.0)** | **−0.1491 ± 0.0322** | 4000 | main paper model |
| RL (full, McClain/PlasmidGPT-RL) | −0.1546 ± 0.0233 | 4000 | ablation-control run, T=0.95 |
| RL (no repeat) | −0.141 ± 0.031 | 4000 | T=0.95 |
| RL (no cassette) | −0.134 ± 0.048 | 4000 | T=0.95 |
| RL (no length) | −0.131 ± 0.025 | 4000 | T=0.95 |
| RL (length only) | −0.126 ± 0.021 | 4000 | T=0.95 |
| RL (CDS only) | −0.103 ± 0.022 | 4000 | T=0.95 |
Addgene reference panel (n=500): **−0.151 ± 0.027**.
### Ablation Study (QC pass rate, temp=0.95)
| Model | Pass rate | Diversity |
|-------|----------:|----------:|
| Base | 3.6 % | 1.000 |
| SFT | 19.7 % | — |
| RL (full) | 53.7 % | 0.132 |
| RL (no repeat) | 72.2 % | 0.446 |
| RL (no length) | 71.4 % | 0.446 |
| RL (length only) | 34.7 % | 0.837 |
| RL (no cassette) | 19.8 % | 0.183 |
| RL (CDS only) | 2.4 % | 1.000 |
### Main 8-prompt evaluation (temp=1.0) — paper Table 1
| Model | Pass rate |
|-------|----------:|
| Base (UCL-CSSB/PlasmidGPT) | 27.0 % |
| SFT (UCL-CSSB/PlasmidGPT-SFT) | 27.2 % |
| RL (UCL-CSSB/PlasmidGPT-GRPO) | 71.6 % |
### Rejection sampling baselines
| Model | Rejection 10K | Best-of-16 | Diversity |
|-------|--------------:|-----------:|----------:|
| Base | 2.8 % | 2.9 % | 1.000 |
| SFT | 2.5 % | 2.8 % | 1.000 |
| GRPO | 64.6 % | 64.6 % | 0.549–0.581 |
### Sampling parameters
| Experiment | Temperature |
|---|---:|
| Training rollouts (GRPO) | 1.229 |
| Main 8-prompt evaluation (`eval_8prompt/`) | 1.0 |
| Ablation evaluation (`generations/temp_0.95/`) | 0.95 |
| Rejection sampling / best-of-16 (`baselines/`) | 0.95 |
Other decoder settings are constant: top-p 0.90, repetition penalty 1.0,
max 256 BPE tokens (≈ 5–15 kb of DNA), stop token id 2.
## Models
| Label | HF repo |
|---|---|
| Base | UCL-CSSB/PlasmidGPT |
| SFT | UCL-CSSB/PlasmidGPT-SFT |
| RL (main paper model) | UCL-CSSB/PlasmidGPT-GRPO |
| RL (ablation control) | McClain/PlasmidGPT-RL |
| Ablations | McClain/plasmidgpt-rl-{cds_only,no_repeat_penalty,no_length_prior,no_cassette_bonus,length_only} |
## QC Pipeline
- BLAST (dc-megablast) against the OriDB reference for ORI detection
- AMRFinderPlus 4.2.7 for ARG detection
- Prodigal 2.6.3 for gene prediction
- Suffix-array repeat detection (≥50 bp)
- Two-stage filter: ORI ≥99 % identity and coverage; AMR ≥100 % identity and coverage
## Data history
- **2026-04-21** SFT resampled at T=0.95 (4000 seqs) and MFE recomputed.
Previous uploads at `generations/temp_{0.8,0.95,1.1}/SFT/` were
byte-identical to the corresponding Base outputs (pipeline bug during
the March upload). The T=0.95 SFT data is correct; the 0.8 and 1.1
copies were deleted.
- **2026-04-20** `eval_8prompt/SFT/SFT_metrics.csv` briefly held T=0.95
data during the SFT resample; restored to the original T=1.0 values
(SHA-verified against an older local copy).

Xet Storage Details

Size:
6.71 kB
·
Xet hash:
6ac5c896e068df615b1afaecbf7523bdfe16351e1d8eeac68fda73f7b6840fd2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.