Buckets:

McClain's picture
|
download
raw
6.71 kB

PlasmidRL — ICML Revision Data

Experimental data for: Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators

Best Model Configuration

UCL-CSSB/PlasmidGPT-GRPO at temperature=1.0 achieves the best quality-diversity tradeoff:

  • 71.6% QC pass rate (ORI + AMR + no repeats)
  • 0.573 diversity (1 − mean pairwise Jaccard of 21-mer MinHash)
  • −0.149 kcal/mol/nt MFE density (DNA parameters)
  • Mean sequence length: 6,517 bp

A separately trained McClain/PlasmidGPT-RL run at temperature=0.95 underperforms (53.7% pass rate, 0.132 diversity) and is retained only as an ablation baseline. It is not the model used for any headline result in the paper.

Bucket Structure

analysis/                              Summary metric CSVs
├── full_ablation_metrics.csv            8-model ablation comparison
├── baselines_qc_metrics.csv             rejection sampling QC results
├── rl_per_prompt_metrics.csv            RL per-prompt breakdown
├── rl_temp_sweep_final.csv              temperature sweep (RL)
└── mfe_summary_all.csv                  MFE across all models

baselines/                             Rejection sampling raw sequences
├── rejection_sampling/{Base,SFT,GRPO}/  10K samples each + metadata
└── best_of_16/{Base,SFT,GRPO}/          16K samples each + metadata

eval_8prompt/                          Main 8-prompt evaluation, T=1.0
└── {Base,SFT}/
    ├── *_metrics.csv                    per-sequence metrics (4,000 rows)
    └── *_summary.json                   aggregate stats

generations/                           Multi-temperature generation
├── temp_0.8/{Base,RL,RL_cds_only,RL_length_only,
│              RL_no_cassette,RL_no_length,RL_no_repeat}/
├── temp_0.95/{Base,SFT,RL,RL_cds_only,RL_length_only,
│              RL_no_cassette,RL_no_length,RL_no_repeat}/
└── temp_1.1/{Base,RL,RL_cds_only,RL_length_only,
              RL_no_cassette,RL_no_length,RL_no_repeat}/

                  NOTE: SFT generations exist only at temp_0.95
                  (resampled 2026-04-21). Earlier uploads at temp_0.8
                  and temp_1.1 were byte-identical duplicates of the
                  corresponding Base outputs and have been deleted.

generations_sweep/                     Temperature sweep (RL + GRPO)
├── temp_{0.3,0.5,0.7}/{RL,GRPO}/
├── temp_{0.9}/{GRPO}/
└── temp_{1.0}/{RL,GRPO}/

mfe/                                   ViennaRNA MFE density (DNA + RNA params)
└── {Base,SFT,RL,RL_cds_only,RL_length_only,RL_no_cassette,
     RL_no_length,RL_no_repeat,GRPO_temp1.0,GRPO_temp0.9}/
    ├── mfe_results.csv                  per-sequence MFE (both RNA and DNA params)
    └── mfe_summary.json                 mean ± std

qc_results/                            QC for ablation study (temp=0.95, 8 models)
└── {Base,SFT,RL,...}/passed.csv, failed.csv, aggregate_*.csv, repeats.csv

qc_baselines/                          QC for rejection sampling
└── {rejection_sampling,best_of_16}/{Base,SFT,GRPO}/

qc_sweep/                              QC for temperature sweep
└── temp_{0.3,...,1.0}/{RL_vllm,GRPO}/

original_paper/                        Frozen snapshots from the pre-revision draft

Key Results

MFE Density (DNA parameters, kcal/mol/nt)

All MFE numbers computed with ViennaRNA 2.7.2, Mathews 2004 DNA energy parameters. Short sequences (≤3 kb) folded circularly; longer sequences use a 500-bp sliding window (stride 250 bp).

Model DNA MFE density n Notes
Base −0.1055 ± 0.0756 4000 baseline, T=0.95
SFT −0.1492 ± 0.0284 4000 T=0.95 (resampled 2026-04-21; previous upload was a Base duplicate)
GRPO (temp=1.0) −0.1491 ± 0.0322 4000 main paper model
RL (full, McClain/PlasmidGPT-RL) −0.1546 ± 0.0233 4000 ablation-control run, T=0.95
RL (no repeat) −0.141 ± 0.031 4000 T=0.95
RL (no cassette) −0.134 ± 0.048 4000 T=0.95
RL (no length) −0.131 ± 0.025 4000 T=0.95
RL (length only) −0.126 ± 0.021 4000 T=0.95
RL (CDS only) −0.103 ± 0.022 4000 T=0.95

Addgene reference panel (n=500): −0.151 ± 0.027.

Ablation Study (QC pass rate, temp=0.95)

Model Pass rate Diversity
Base 3.6 % 1.000
SFT 19.7 %
RL (full) 53.7 % 0.132
RL (no repeat) 72.2 % 0.446
RL (no length) 71.4 % 0.446
RL (length only) 34.7 % 0.837
RL (no cassette) 19.8 % 0.183
RL (CDS only) 2.4 % 1.000

Main 8-prompt evaluation (temp=1.0) — paper Table 1

Model Pass rate
Base (UCL-CSSB/PlasmidGPT) 27.0 %
SFT (UCL-CSSB/PlasmidGPT-SFT) 27.2 %
RL (UCL-CSSB/PlasmidGPT-GRPO) 71.6 %

Rejection sampling baselines

Model Rejection 10K Best-of-16 Diversity
Base 2.8 % 2.9 % 1.000
SFT 2.5 % 2.8 % 1.000
GRPO 64.6 % 64.6 % 0.549–0.581

Sampling parameters

Experiment Temperature
Training rollouts (GRPO) 1.229
Main 8-prompt evaluation (eval_8prompt/) 1.0
Ablation evaluation (generations/temp_0.95/) 0.95
Rejection sampling / best-of-16 (baselines/) 0.95

Other decoder settings are constant: top-p 0.90, repetition penalty 1.0, max 256 BPE tokens (≈ 5–15 kb of DNA), stop token id 2.

Models

Label HF repo
Base UCL-CSSB/PlasmidGPT
SFT UCL-CSSB/PlasmidGPT-SFT
RL (main paper model) UCL-CSSB/PlasmidGPT-GRPO
RL (ablation control) McClain/PlasmidGPT-RL
Ablations McClain/plasmidgpt-rl-{cds_only,no_repeat_penalty,no_length_prior,no_cassette_bonus,length_only}

QC Pipeline

  • BLAST (dc-megablast) against the OriDB reference for ORI detection
  • AMRFinderPlus 4.2.7 for ARG detection
  • Prodigal 2.6.3 for gene prediction
  • Suffix-array repeat detection (≥50 bp)
  • Two-stage filter: ORI ≥99 % identity and coverage; AMR ≥100 % identity and coverage

Data history

  • 2026-04-21 SFT resampled at T=0.95 (4000 seqs) and MFE recomputed. Previous uploads at generations/temp_{0.8,0.95,1.1}/SFT/ were byte-identical to the corresponding Base outputs (pipeline bug during the March upload). The T=0.95 SFT data is correct; the 0.8 and 1.1 copies were deleted.
  • 2026-04-20 eval_8prompt/SFT/SFT_metrics.csv briefly held T=0.95 data during the SFT resample; restored to the original T=1.0 values (SHA-verified against an older local copy).

Xet Storage Details

Size:
6.71 kB
·
Xet hash:
6ac5c896e068df615b1afaecbf7523bdfe16351e1d8eeac68fda73f7b6840fd2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.