Buckets:

UCL-CSSB
/

PlasmidRL-ICML

Files

xet

UCL-CSSB/PlasmidRL-ICML / INDEX.md

McClain

4 days ago

preview code

download

raw

9.28 kB

UCL-CSSB/PlasmidRL-ICML — index of canonical artifacts

ICML camera-ready artifacts for Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators.

Lineage (parallel post-training paths from the same Base):

Base = UCL-CSSB/PlasmidGPT
├─→ SFT next-token loss      → UCL-CSSB/PlasmidGPT-SFT  (sha daeaabf, post-2026-05-02 cleanup)
└─→ GRPO reward shaping       → UCL-CSSB/PlasmidGPT-GRPO  (sha db2462a)

The 5 reward-ablation models branch from SFT, separately. McClain/PlasmidGPT-RL also branches from SFT — kept in deprecated/early_rl_lineage/ as appendix material.

Headline numbers (analysis2 strict QC, 8-prompt eval, T-matched)

Model	T=1.0 (n=4000)	T=0.95 (n=4000)	rejection_v3 (n=10K, 8-prompt @ sweep-optimal T)
Base	4.275%	—	3.99% (T=1.0)
SFT	10.975%	—	10.87% (T=1.0)
GRPO	71.575%	66.875%	78.66% (T=1.15)

Headline lift (GRPO/Base @ T=1.0): ~16.7×.

Layout

README.md                             interim
INDEX.md                              this file
SFT_STALE.md                          status flags

analysis/                             truth-set CSVs + distribution metrics + plannotate
├── distribution_metrics.csv          per-cell length/GC/ORF/JSD/Jaccard
├── distribution/per_seq_{Base,SFT,RL}.csv
├── distribution_report.html
└── (table1_*, table4_*, ... pending manifest build)

continuation_benchmark/               held-out continuation + surprisal benchmarks
├── eval_set_656/                     656 plasmids × 5 splits — primary eval
│   ├── summary.json
│   ├── per_split_{completion,surprisal}.csv
│   ├── per_plasmid_{completion,surprisal}.csv
│   ├── all_{completion,surprisal}.csv  (window-level)
│   ├── full_set.fasta
│   ├── metadata.tsv
│   └── report.html
├── heldout_eng_r3/                   PLSDB-style (F1-F6 NCBI queries) engineered held-out
│   ├── summary.json
│   ├── all_{completion,surprisal}.csv
│   └── report.html
├── both_metric_eval/                 47 archetype-matched (5 archetypes)
│   ├── summary.json
│   ├── joint_per_plasmid.csv
│   ├── per_plasmid_{completion,surprisal}.csv
│   ├── metadata.tsv
│   └── both_metric_candidates.fasta
├── validation_eval/                  80-plasmid regression test (3 strata)
│   ├── summary.json
│   ├── per_plasmid_{completion,surprisal}.csv
│   ├── metadata.tsv
│   └── validation_set.fasta
└── holdout30_non_addgene/            29 curated non-Addgene
    ├── holdout30_non_addgene.csv
    └── holdout30_non_addgene.fasta

evaluation/                           generation outputs + QC
├── eight_prompt/{Base,SFT,RL}/       Table 1 sources at T=1.0
└── eight_prompt/ablations/{full_reward,5×reward_ablations}/   Table 7

mfe/                                  MFE under DNA Mathews 2004 params
├── Base/                              n=4000 SFT-fixed Base, T=0.95 (paper original)
├── RL/                               n=4000 GRPO @ T=1.0 (= old GRPO_temp1.0)
├── ablations/{...}/                  5 reward-ablation models
├── SFT_real/                         n=4000 SFT @ T=1.0 (replaces stale mfe/SFT/) — mean −0.148
├── SFT_circ10k_subset/               96 stratified, circular-folding for short seqs — mean −0.172
├── SFT_temp_sweep/                   200/T at T={0.5, 0.8, 0.95, 1.0, 1.15, 1.3}
├── RL_t1.15_8prompt/                 GRPO @ T=1.15 (sweep-optimal) — mean −0.155
└── RL_temp_sweep_2prompt/            GRPO across T={0.5, ..., 1.3}, 2-prompt protocol

rejection_sampling_v2/                ORIGINAL paper Table 4 source (2-prompt, plasmidkit-loose QC)
├── direct/{Base,SFT,GRPO}/           SFT cell still uses pre-fix checkpoint — see SFT_STALE.md
└── best_of_16/{Base,SFT,GRPO}/

rejection_v3/                         NEW — 8-prompt × 1250 = 10K, analysis2 strict QC
├── Base/    metadata.json (3.99%)
├── SFT/     metadata.json (10.87%)
└── GRPO/    metadata.json (78.66% @ T=1.15)

rejection_topK/                       NEW — top-K-of-K sampling success rate (M=50 trials, 8 prompts)
├── summary.json                       success rate per (model, K∈{1,4,16,64})
├── success_per_model_K.csv
├── success_summary.csv               per-prompt breakdown
├── diversity.csv                     Jaccard similarity of kept samples
├── ori_usage.csv                     ORI breakdown of kept samples
├── amr_usage.csv                     AMR breakdown of kept samples
├── per_attempt.csv                   trial-level data
├── kept_samples.csv + .fasta         all selected samples

plannotate/                           Table 8 — pLannotate-detected ORI breakdown
├── RL/                               GRPO @ T=1.0
└── {Base_t0.95, SFT_t0.95}/          supplementary (T=0.95 versions)

novelty_blastn/summary.csv            Table 2 (n=22/28/30 BLAST against Addgene)

reference/addgene_500/                Reference panel: plasmids.csv + metrics.csv + 3mer_freqs.json

models/pinned_shas.csv                8 model commit SHAs (SFT updated to daeaabf)

code_snapshots/                       git SHAs (PlasmidRL, analysis2, plasmid-rl-paper-2)

manifests/                            pending — paper_v2_camera_ready.json + deprecated.json

deprecated/                           audit trail (v1 baselines, early RL lineage, old figures)

original_paper/                       frozen pre-revision data

Key new findings vs paper draft

Headline lift is ~16.7×, not 2.7× — from QC-pipeline tightening (analysis2 strict QC). The 71.6% RL number is unchanged; Base/SFT drop because the loose QC was overly permissive.
Alignment tax on continuation logprob is real and replicated: across 656/47/80-plasmid evals, RL is −2 to −3 nats per window worse than SFT on continuation. RL wins SFT in 0–12% of plasmids. Paper's "evidence too thin to claim alignment tax" should flip to "alignment tax is measurable; RL trades next-token prediction for QC pass rate".
Lineage is parallel, not serial: GRPO trained from Base, not from SFT. Abstract / §3.2 framing of "RL preserves the SFT-induced thermodynamic manifold" needs rewording — RL didn't inherit what it never saw. Both SFT and GRPO independently land near real-plasmid MFE / JSD / ORF length via different mechanisms.
SFT generates plasmid-scale sequences (mean length 5,441 bp, ORF 272 aa) — old SFT data showed Base-like 1,970 bp due to the model.safetensors checkpoint dispatch issue. SFT's MFE of −0.148 happens to match the paper's reported −0.149.
Diversity convention: the paper uses Jaccard distance (~1.0 = diverse). New distribution_metrics.csv reports Jaccard similarity (low = diverse). Convert via distance = 1 − similarity when reading. Paper's "RL diversity 0.573" = new "RL Jaccard similarity 0.426". Same signal.

Ablation table (T=1.15, post-2026-05-05)

evaluation/eight_prompt/ablations/manifest.json is the source of truth. Pass rates at sweep-optimal T=1.15 (matching rejection-sampling protocol):

Ablation	Pass%	MFE (kcal/mol/bp)	Diversity (1−Jaccard)	T=0.95 pass% (deprecated)
full_reward	78.35%	−0.165	0.585	66.88%
no_repeat_penalty	75.15%	−0.151	0.419	72.17%
no_length_prior	72.15%	−0.140	0.459	71.38%
no_cassette_bonus	44.52%	−0.170	0.369	19.80%
length_only	37.90%	−0.131	0.772	34.73%
cds_only	1.73%	−0.130	0.861	2.40%
Addgene baseline	—	—	0.925	—

Pass rate from evaluation/eight_prompt/ablations/{cell}/qc/qc_summary.csv (n=4000 generations per cell, analysis2 strict QC).
MFE from evaluation/eight_prompt/ablations/{cell}/mfe/mfe_summary.json (n=200 random subset per cell at T=1.15; cds_only n=69 because only 69 sequences pass strict QC; computed with ViennaRNA 2.7.2 / Mathews 2004 DNA params, 1000 bp window).
Diversity: 1 − mean pairwise 21-mer Jaccard similarity (n=200 sampled from passing sequences, except cds_only n=69). Addgene baseline (n=200 from reference) = 0.9245.
Per-cell metadata.json carries seed, sampling params, model SHA, and sha256 of every output file.

Cassette-bonus removal is still the largest single-component drop (78.35 → 44.52 = 33.8 pp). T=0.95 data preserved at deprecated/ablations_t0.95/ (bucket-state archive) and deprecated/ablations_t0.95_source/ (strict-QC source files at T=0.95).

Pending decisions

Table 4 protocol: keep paper's 2-prompt v2 (plasmidkit-loose QC) or switch to 8-prompt rejection_v3 (analysis2 strict QC)?
Table 5 source: smaller 11-plasmid (paper-original numbers reproduce) or larger 656-plasmid eval_set?
MFE protocol: 8-prompt full or 2-prompt protocol for the camera-ready Table 6 row?
Table 7 row 1 number: 66.875% (GRPO @ T=0.95) recommended; awaiting final paper-text confirmation.

Xet Storage Details

Size:: 9.28 kB
Xet hash:: f93c3eca349aa95cb74d9e696536b14750503a56a9aa5bd51c39c000d275252c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.