Buckets:

UCL-CSSB
/

PlasmidRL-ICML

Files

xet

UCL-CSSB/PlasmidRL-ICML / deprecated /README_v1.md

McClain

16 days ago

preview code

download

raw

6.71 kB

	# PlasmidRL — ICML Revision Data

	Experimental data for: Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators

	## Best Model Configuration

	UCL-CSSB/PlasmidGPT-GRPO at temperature=1.0 achieves the best quality-diversity tradeoff:
	- 71.6% QC pass rate (ORI + AMR + no repeats)
	- 0.573 diversity (1 − mean pairwise Jaccard of 21-mer MinHash)
	- −0.149 kcal/mol/nt MFE density (DNA parameters)
	- Mean sequence length: 6,517 bp

	A separately trained `McClain/PlasmidGPT-RL` run at temperature=0.95 underperforms
	(53.7% pass rate, 0.132 diversity) and is retained only as an ablation baseline.
	It is not the model used for any headline result in the paper.

	## Bucket Structure

	```
	analysis/ Summary metric CSVs
	├── full_ablation_metrics.csv 8-model ablation comparison
	├── baselines_qc_metrics.csv rejection sampling QC results
	├── rl_per_prompt_metrics.csv RL per-prompt breakdown
	├── rl_temp_sweep_final.csv temperature sweep (RL)
	└── mfe_summary_all.csv MFE across all models

	baselines/ Rejection sampling raw sequences
	├── rejection_sampling/{Base,SFT,GRPO}/ 10K samples each + metadata
	└── best_of_16/{Base,SFT,GRPO}/ 16K samples each + metadata

	eval_8prompt/ Main 8-prompt evaluation, T=1.0
	└── {Base,SFT}/
	├── *_metrics.csv per-sequence metrics (4,000 rows)
	└── *_summary.json aggregate stats

	generations/ Multi-temperature generation
	├── temp_0.8/{Base,RL,RL_cds_only,RL_length_only,
	│ RL_no_cassette,RL_no_length,RL_no_repeat}/
	├── temp_0.95/{Base,SFT,RL,RL_cds_only,RL_length_only,
	│ RL_no_cassette,RL_no_length,RL_no_repeat}/
	└── temp_1.1/{Base,RL,RL_cds_only,RL_length_only,
	RL_no_cassette,RL_no_length,RL_no_repeat}/

	NOTE: SFT generations exist only at temp_0.95
	(resampled 2026-04-21). Earlier uploads at temp_0.8
	and temp_1.1 were byte-identical duplicates of the
	corresponding Base outputs and have been deleted.

	generations_sweep/ Temperature sweep (RL + GRPO)
	├── temp_{0.3,0.5,0.7}/{RL,GRPO}/
	├── temp_{0.9}/{GRPO}/
	└── temp_{1.0}/{RL,GRPO}/

	mfe/ ViennaRNA MFE density (DNA + RNA params)
	└── {Base,SFT,RL,RL_cds_only,RL_length_only,RL_no_cassette,
	RL_no_length,RL_no_repeat,GRPO_temp1.0,GRPO_temp0.9}/
	├── mfe_results.csv per-sequence MFE (both RNA and DNA params)
	└── mfe_summary.json mean ± std

	qc_results/ QC for ablation study (temp=0.95, 8 models)
	└── {Base,SFT,RL,...}/passed.csv, failed.csv, aggregate_*.csv, repeats.csv

	qc_baselines/ QC for rejection sampling
	└── {rejection_sampling,best_of_16}/{Base,SFT,GRPO}/

	qc_sweep/ QC for temperature sweep
	└── temp_{0.3,...,1.0}/{RL_vllm,GRPO}/

	original_paper/ Frozen snapshots from the pre-revision draft
	```

	## Key Results

	### MFE Density (DNA parameters, kcal/mol/nt)

	All MFE numbers computed with ViennaRNA 2.7.2, Mathews 2004 DNA energy
	parameters. Short sequences (≤3 kb) folded circularly; longer sequences
	use a 500-bp sliding window (stride 250 bp).

	\| Model \| DNA MFE density \| n \| Notes \|
	\|---\|---:\|---:\|---\|
	\| Base \| −0.1055 ± 0.0756 \| 4000 \| baseline, T=0.95 \|
	\| SFT \| −0.1492 ± 0.0284 \| 4000 \| T=0.95 (resampled 2026-04-21; previous upload was a Base duplicate) \|
	\| GRPO (temp=1.0) \| −0.1491 ± 0.0322 \| 4000 \| main paper model \|
	\| RL (full, McClain/PlasmidGPT-RL) \| −0.1546 ± 0.0233 \| 4000 \| ablation-control run, T=0.95 \|
	\| RL (no repeat) \| −0.141 ± 0.031 \| 4000 \| T=0.95 \|
	\| RL (no cassette) \| −0.134 ± 0.048 \| 4000 \| T=0.95 \|
	\| RL (no length) \| −0.131 ± 0.025 \| 4000 \| T=0.95 \|
	\| RL (length only) \| −0.126 ± 0.021 \| 4000 \| T=0.95 \|
	\| RL (CDS only) \| −0.103 ± 0.022 \| 4000 \| T=0.95 \|

	Addgene reference panel (n=500): −0.151 ± 0.027.

	### Ablation Study (QC pass rate, temp=0.95)

	\| Model \| Pass rate \| Diversity \|
	\|-------\|----------:\|----------:\|
	\| Base \| 3.6 % \| 1.000 \|
	\| SFT \| 19.7 % \| — \|
	\| RL (full) \| 53.7 % \| 0.132 \|
	\| RL (no repeat) \| 72.2 % \| 0.446 \|
	\| RL (no length) \| 71.4 % \| 0.446 \|
	\| RL (length only) \| 34.7 % \| 0.837 \|
	\| RL (no cassette) \| 19.8 % \| 0.183 \|
	\| RL (CDS only) \| 2.4 % \| 1.000 \|

	### Main 8-prompt evaluation (temp=1.0) — paper Table 1

	\| Model \| Pass rate \|
	\|-------\|----------:\|
	\| Base (UCL-CSSB/PlasmidGPT) \| 27.0 % \|
	\| SFT (UCL-CSSB/PlasmidGPT-SFT) \| 27.2 % \|
	\| RL (UCL-CSSB/PlasmidGPT-GRPO) \| 71.6 % \|

	### Rejection sampling baselines

	\| Model \| Rejection 10K \| Best-of-16 \| Diversity \|
	\|-------\|--------------:\|-----------:\|----------:\|
	\| Base \| 2.8 % \| 2.9 % \| 1.000 \|
	\| SFT \| 2.5 % \| 2.8 % \| 1.000 \|
	\| GRPO \| 64.6 % \| 64.6 % \| 0.549–0.581 \|

	### Sampling parameters

	\| Experiment \| Temperature \|
	\|---\|---:\|
	\| Training rollouts (GRPO) \| 1.229 \|
	\| Main 8-prompt evaluation (`eval_8prompt/`) \| 1.0 \|
	\| Ablation evaluation (`generations/temp_0.95/`) \| 0.95 \|
	\| Rejection sampling / best-of-16 (`baselines/`) \| 0.95 \|

	Other decoder settings are constant: top-p 0.90, repetition penalty 1.0,
	max 256 BPE tokens (≈ 5–15 kb of DNA), stop token id 2.

	## Models

	\| Label \| HF repo \|
	\|---\|---\|
	\| Base \| UCL-CSSB/PlasmidGPT \|
	\| SFT \| UCL-CSSB/PlasmidGPT-SFT \|
	\| RL (main paper model) \| UCL-CSSB/PlasmidGPT-GRPO \|
	\| RL (ablation control) \| McClain/PlasmidGPT-RL \|
	\| Ablations \| McClain/plasmidgpt-rl-{cds_only,no_repeat_penalty,no_length_prior,no_cassette_bonus,length_only} \|

	## QC Pipeline

	- BLAST (dc-megablast) against the OriDB reference for ORI detection
	- AMRFinderPlus 4.2.7 for ARG detection
	- Prodigal 2.6.3 for gene prediction
	- Suffix-array repeat detection (≥50 bp)
	- Two-stage filter: ORI ≥99 % identity and coverage; AMR ≥100 % identity and coverage

	## Data history

	- 2026-04-21 SFT resampled at T=0.95 (4000 seqs) and MFE recomputed.
	Previous uploads at `generations/temp_{0.8,0.95,1.1}/SFT/` were
	byte-identical to the corresponding Base outputs (pipeline bug during
	the March upload). The T=0.95 SFT data is correct; the 0.8 and 1.1
	copies were deleted.
	- 2026-04-20 `eval_8prompt/SFT/SFT_metrics.csv` briefly held T=0.95
	data during the SFT resample; restored to the original T=1.0 values
	(SHA-verified against an older local copy).

Xet Storage Details

Size:: 6.71 kB
Xet hash:: 6ac5c896e068df615b1afaecbf7523bdfe16351e1d8eeac68fda73f7b6840fd2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.