Buckets:
| # Eval Suite Log | |
| ## 2026-04-08 | |
| **11:30** — Phase 0 started. Installed sourmash 4.3, minimap2 2.30, lightgbm 4.6, shap 0.49, sklearn 1.7 into plannotate conda env. Fixed sourmash pkg_resources issue (setuptools<70). | |
| **11:45** — Downloaded Addgene-500 reference panel from HF bucket `McClain/PlasmidRL/reference/`. 500 plasmids, 418 not in training set. Pre-computed: length, GC, longest ORF, MFE density, 3-mer freqs. | |
| **11:50** — Pinned tool versions in `eval/config/environment.yml`. pLannotate 1.2.2 (snapgene DB 2021-11), prodigal 2.6.3, BLAST+ 2.17.0. | |
| **11:55** — Built `eval/config/feature_categories.yaml` mapping pLannotate feature types to Tier 3 booleans (origin, selection_marker, promoter, terminator, cds). | |
| **12:00** — Built training-set sourmash signature: 20,644 unique plasmids, k=31 scaled=100. Cached at `eval/reference/training_sigs.zip`. | |
| **12:05** — Building negative controls (random GC-matched, dinucleotide-shuffled, real held-out). Writing generation + annotation pipeline scripts. | |
| **12:30** — Negative controls built (500 each: random, shuffled, real). Wrote pipeline scripts: generate.py, annotate.py, compute_metrics.py, firstlight_sweep.py. | |
| **13:00** — Tested generation: 5 seqs in 21s, mean 8.1kb, 54% GC. Fixed tokenizer padding issue (custom PlasmidKmerTokenizer doesn't expose pad/eos tokens properly). | |
| **14:00** — Tested annotation pipeline end-to-end: pLannotate 3.1s/seq, prodigal 0.3s total, dustmasker 0.1s total. Results look good — 3/5 test sequences show >60% pLannotate coverage. | |
| **14:15** — Launched Phase 1 first-light sweep: 12 cells (temp×top_p×rep_pen), 50 seqs/cell = 600 total. Model loaded once, reused across cells. ETA ~45-60 min. | |
| **14:16** — huggingface-hub version conflict flagged: upgraded to 1.9.2 for bucket API, but transformers needs <1.0. Downgraded to 0.36.2. Bucket uploads will need separate handling. | |
| **14:50** — First-light sweep complete. 12 cells × 50 seqs = 600 generations in 29.6 min. pLannotate annotation initially failed due to `--no-banner` flag — ran retroactively on all cells (~3 min total). | |
| **14:55** — **Winner: t0.7_p0.95_r1.0** (temp=0.7, top_p=0.95, rep_pen=1.0). 70.8% median pLannotate coverage, 22.1 mean features, 7.6kb mean length. Clear winner — temp=0.7 dominates, higher temps drop coverage significantly. Repetition penalty doesn't help. | |
| **15:05** — Extended sweep to test temp=0.3 and 0.5. Coverage plateaus at ~70% across 0.3-0.7, but temp=0.7 has longest sequences (7.6kb, closest to Addgene ref 7.5kb) and fewest <1kb failures. Keeping temp=0.7. | |
| **15:15** — Uploaded all Phase 0+1 data to HF bucket `McClain/PlasmidLMEval` (81 files, 17.2 MB). Used `hf buckets sync` CLI — the Python `batch_bucket_files()` API silently fails for non-README files. | |
| **15:20** — Phase 2 started: generating 1000 plasmids with winning config (temp=0.7, top_p=0.95, top_k=50, no rep penalty). Model: `McClain/PlasmidLM-kmer6-GRPO-plannotate`. | |
| **16:06** — Phase 2 complete. 1000 sequences in 46.3 min (21.6 seq/min). Mean length 7.2kb, median 9.0kb, 50.6% GC, 48.7% EOS rate, 5.6% <1kb. Baselines also annotated: random 5.2% hit rate, shuffled 7.6%, real 100%. | |
| **16:07** — Phase 3 started: running pLannotate (8 workers) + Prodigal + dustmasker on all 1000 generated sequences. | |
| **16:40** — Phase 3 complete. pLannotate: ~2s/seq with 8 workers on 1000 seqs. Prodigal and dustmasker fast (<30s total). | |
| **16:41** — Phase 4 metrics computed on all 1000 generated sequences: | |
| ### Tier 1: Distributional | |
| - Length KS stat: 0.256 (p < 1e-19) — significant difference from reference (generated skews longer) | |
| - GC KS stat: 0.121 (p = 0.0001) — slight difference, generated slightly lower GC | |
| - Wasserstein distances: length=1338bp, GC=0.020 | |
| ### Tier 3: Essentials | |
| | Metric | Generated (n=1000) | Real Addgene (n=500) | Random (n=500) | Shuffled (n=500) | | |
| |---|---|---|---|---| | |
| | has_ori | 63.8% | — | — | — | | |
| | has_selection_marker | 65.2% | — | — | — | | |
| | has_promoter | 83.0% | — | — | — | | |
| | has_terminator | 60.1% | — | — | — | | |
| | has_cds | 85.0% | — | — | — | | |
| | **plausibility_pass** | **59.6%** | **95.0%** | **0.0%** | **0.0%** | | |
| | mean n_features | 19.2 | — | — | — | | |
| | mean coverage_frac | 50.1% | — | — | — | | |
| ### Tier 5: Architecture | |
| - CDS with promoter context (within 500bp): 43.0% | |
| - CDS with terminator context: 8.6% | |
| - Mean origins per plasmid: 1.6 | |
| - Mean selection markers: 1.8 | |
| - Mean overlapping features: 5.3 | |
| ### Memorization | |
| - **Zero hits** at containment threshold 0.3 — no evidence of training set memorization | |
| ### Discriminator | |
| - LightGBM AUC: **0.937** (real vs generated) | |
| - Top features by importance: length (5627), ORF density (3533), 6-mer diversity (1735), GC (1419) | |
| - SHAP summary plot saved | |
| **16:45** — Baseline metrics computed: random plausibility 0%, shuffled 0%, real Addgene 95%. Generated model at 59.6% is well above floor and approaching ceiling. | |
| **17:30** — Diversity metrics computed. Model produces comparable diversity to real Addgene plasmids — no mode collapse. | |
| | Metric | Generated | Addgene-500 | | |
| |---|---|---| | |
| | 6-mer JSD (mean pairwise) | 0.352 | 0.334 | | |
| | 6-mer cosine distance | 0.288 | 0.267 | | |
| | 6-mer Jaccard similarity | 0.778 | 0.793 | | |
| | Near-identical pairs (JSD<0.01) | 0.0% | 0.0% | | |
| | Pairs with Jaccard >0.9 | 25.3% | 28.3% | | |
| High Jaccard is expected — real plasmids share extensive backbone sequences (amp resistance, ColE1 ori, etc.). Generated plasmids are slightly more diverse than real ones on all k-mer metrics. | |
| **17:45** — Extra metrics computed: prompt fidelity, codon usage, GC skew, ViennaRNA MFE (running). | |
| ### Prompt Fidelity (Conditional Accuracy) | |
| Overall: **23.5%** of requested features detected by pLannotate in generated sequences. | |
| | Category | Hit Rate | Found/Requested | | |
| |---|---|---| | |
| | AMR | 39.2% | 538/1373 | | |
| | PROM | 38.5% | 1090/2828 | | |
| | REPORTER | 27.1% | 95/350 | | |
| | ORI | 19.9% | 339/1701 | | |
| | ELEM | 15.9% | 634/3991 | | |
| | TAG | 1.1% | 2/179 | | |
| **Updated (sseqid-based via motif registry):** 41.6% overall — properly matching against the 660 sseqids in the motif registry that map tokens to pLannotate feature IDs. Previous keyword-based approach (23.5%) was undercounting due to brittle string matching. | |
| | Category | Hit Rate | Found/Requested | | |
| |---|---|---| | |
| | ORI | 51.9% | 882/1701 | | |
| | ELEM | 51.1% | 2040/3991 | | |
| | PROM | 39.5% | 1118/2828 | | |
| | AMR | 29.1% | 399/1373 | | |
| | REPORTER | 21.1% | 74/350 | | |
| | TAG | 2.8% | 5/179 | | |
| Tags remain very low — these are tiny peptide sequences (6-18 bp coding) that BLAST/DIAMOND struggle to detect even in real plasmids. | |
| **Detection ceiling:** pLannotate on real training plasmids only detects 71.8% of their own tokens — this is the maximum achievable fidelity with this detection method. Generated model achieves **57.9% of ceiling** (41.6/71.8%). Per-category ceiling ratios: ORI 73.7%, ELEM 72.9%, PROM 56.0%, AMR 40.9%, REPORTER 31.3%, TAG 4.2%. | |
| ### Codon Usage | |
| - 3,376 predicted ORFs, 579,368 total codons | |
| - JSD vs E. coli codon frequencies: **0.131** (low divergence = realistic codon usage) | |
| ### GC Skew | |
| - Generated mean variation: **0.0566** (reference: 0.0593) | |
| - Very close to real Addgene plasmids — model captures characteristic GC skew patterns | |
| ### ViennaRNA MFE Density (DNA Mathews2004 params) | |
| - Reference (Addgene-500): **-0.151** kcal/mol/nt | |
| - Generated (n=200 subsample, 500bp windows): **-0.144** kcal/mol/nt | |
| - Close to reference — generated plasmids have realistic thermodynamic stability | |
| - Note: RNA.fold O(n³) makes full-length folding impractical for >5kb sequences; used 500bp windowed sampling | |
| ## 2026-04-20 — Base-model (pre-GRPO) eval + paired comparison | |
| **Context.** The original `runs/grpo_plannotate_full_20260408/` evaluated only the post-GRPO | |
| model (`McClain/PlasmidLM-kmer6-GRPO-plannotate`). For the NeurIPS submission we need a | |
| matched pre-GRPO baseline, paired per-prompt, so we can attribute the improvement in | |
| plausibility_pass and per-category hit rate specifically to the post-training step. | |
| **GRPO run that produced the uploaded model.** One-shot GRPO from the pretrained base | |
| (`McClain/PlasmidLM-kmer6`) with the pLannotate composite reward, no curriculum, no | |
| motif warm-start. Config: kl=0.5, clip=0.2, group size 8, prompt batch 8, lr 5e-6, warmup 50, | |
| no length penalty; rollouts at temp=0.3 / top_p=0.95 / max_new_tokens=2500. Scheduled 1000 | |
| steps, stopped at ~900, step_800 uploaded. The W&B logs for this run are not recoverable; | |
| the reconstructed config is committed to the HF repo as `training_config.py` (matches the | |
| S3 checkpoint byte-for-byte via SHA verification). | |
| **Paired prompt set.** `eval/data/paired_prompts_20260420.parquet` (1000 rows, 1000 unique) — | |
| extracted directly from `runs/grpo_plannotate_full_20260408/generations.parquet` so the two | |
| runs share identical prompts in identical order. | |
| ### Generation (g6-big, ~67 min total) | |
| Matched-config run (Phase A, paired with GRPO): | |
| - Model source: `McClain/PlasmidLM-kmer6` downloaded to | |
| `/opt/dlami/nvme/models/plasmid_lm_kmer6_local/`. | |
| - Modeling code swap: the base repo's `modeling_plasmid_lm.py` predates the | |
| `generate_simple(attention_mask, top_p, eos_token_id, pad_token_id)` signature used by | |
| the eval pipeline. To keep the two runs on an identical generation path, we replaced it | |
| with the copy from `McClain/PlasmidLM-kmer6-GRPO-plannotate`. Weights, tokenizer, config | |
| are unchanged. Forward pass is byte-for-byte equivalent; only the inference helper was | |
| upgraded. | |
| - Sampling: temp=0.7 top_p=0.95 top_k=50 max_new_tokens=3000 batch_size=8 seed=42. | |
| - 1000 sequences in 46.7 min, mean length 6.9 kb, median 7.8 kb, 60.7% EOS rate, 3.4% < 1 kb | |
| (GRPO for comparison: 7.2 kb / 9.0 kb / 48.7% / 5.6% — base emits EOS more reliably; | |
| GRPO shows mild length drift). | |
| - Output: `runs/base_kmer6_matched_20260420_1303/` in bucket. | |
| Temperature sweep (Phase B, base only, not paired): | |
| - Temps ∈ {0.3, 0.5, 1.0} × 100 seqs each from `training_pairs_v4.parquet`. t=0.7 already | |
| covered by Phase A. | |
| - Output: `runs/base_kmer6_sweep_20260420_1303/t{03,05,10}/`. | |
| ### Post-processing fixes | |
| - The `plannotate` conda env on g6-big had drifted: `joblib`, `lightgbm`, `shap`, `sklearn`, | |
| `statsmodels`, `scipy`, `pyyaml`, `sourmash` were all missing despite `environment.yml` | |
| listing them. Pip-installed. `sourmash` downgraded `cachetools` 7.0.1→5.5.2 (did not break | |
| pLannotate). `prodigal` CLI also missing — installed via `apt-get install prodigal`. | |
| - Bug in `eval/scripts/annotate.py`: `run_dustmasker` passes the raw FASTA to `dustmasker`, | |
| which aborts on any zero-residue record. The matched run had 1 empty record (base model | |
| emitting EOS immediately for one prompt). Fixed in-place by pre-filtering empty records | |
| and reporting the dropped count. | |
| - **Prompt fidelity method correction.** `compute_extra_metrics.compute_prompt_fidelity` | |
| was using a hardcoded keyword-matching dict that yielded ~23% on the GRPO run. The | |
| supposed "sseqid via motif registry" method (41.6% overall from 2026-04-09) was never | |
| committed — only the eval_log note. Today we implemented it properly: load | |
| `motif_registry_combined.parquet`, build `token -> set[sseqid]`, per-generation score is | |
| set intersection with the sseqids pLannotate emits. Matches the 41.6% number on the GRPO | |
| run (ORI 51.9%, ELEM 51.1%, PROM 39.5%, AMR 29.1%, REPORTER 21.1%, TAG 2.8% — all | |
| identical to the prior note). Also now writes `metrics/prompt_fidelity_per_seq.parquet` | |
| (one row per prompt) so paired analyses are possible without recomputing. | |
| ### Paired results (matched t=0.7, n=1000) | |
| | Metric | Base | GRPO | Δ | | |
| |---|---:|---:|---:| | |
| | plausibility_pass (structural: has ori + marker + (prom OR cds)) | 56.8% | 59.6% | +2.8 pp | | |
| | prompt_fidelity overall (sseqid intersection) | **35.5%** | **41.6%** | **+6.1 pp** | | |
| | fidelity — AMR | 26.1% | 29.1% | +3.0 pp | | |
| | fidelity — ORI | 44.1% | 51.9% | **+7.8 pp** | | |
| | fidelity — PROM | 35.3% | 39.5% | +4.2 pp | | |
| | fidelity — ELEM | 46.1% | 51.1% | +5.0 pp | | |
| | fidelity — REPORTER | 19.1% | 21.1% | +2.0 pp | | |
| | fidelity — TAG | 8.4% | 2.8% | **−5.6 pp** | | |
| | discriminator AUC vs Addgene-500 | 1.000 | 0.937 | −0.063 | | |
| | memorization hits (sourmash containment ≥ 0.3) | 0 | 0 | — | | |
| ### Base temperature sweep (n=100 per cell) | |
| | Temp | plausibility | prompt_fidelity | disc AUC | | |
| |---:|---:|---:|---:| | |
| | 0.3 | 60.0% | 37.8% | 1.000 | | |
| | 0.5 | 56.0% | 32.5% | 1.000 | | |
| | 0.7 (matched, n=1000) | 56.8% | 35.5% | 1.000 | | |
| | 1.0 | 42.0% | 32.3% | 1.000 | | |
| ### Takeaways | |
| - Headline "+8.8 pp plausibility" on the HF model card does not hold up under matched | |
| paired sampling; the matched delta is +2.8 pp on plausibility but **+6.1 pp on prompt | |
| fidelity**. Paper figure should lead with fidelity. | |
| - Base is generating plausible-but-generic plasmids: similar structural pass rate, but | |
| substantially lower on actually containing the requested features. RL mostly improves | |
| *conditioning*, not *plasmid-ness*. | |
| - ORI and ELEM are the biggest fidelity gains (+7.8 / +5.0 pp). TAG regresses (−5.6 pp): | |
| short peptide tags rarely get scored by BLAST, so RL learns to deprioritise them. | |
| - Discriminator AUC goes from 1.000 (base → trivially separable from real Addgene) to | |
| 0.937 (GRPO → real artifacts) — RL reduces generator-specific signatures. Need to inspect | |
| SHAP top features on both to confirm this isn't a length/GC artifact. | |
| - Need next: paired McNemar + bootstrap CIs on the 1000-prompt data (analysis is in | |
| `notebooks/base_vs_grpo_analysis.py`). | |
| ## 2026-04-20 — Best-of-K rejection sampling, paired, base vs GRPO | |
| **Motivation.** Reviewer would ask: "can inference-time compute (best-of-K sampling) close | |
| the GRPO gap?" A strong negative answer — "GRPO still wins at matched K" — is the cleanest | |
| paper story. | |
| ### Design | |
| Three additional seeds per model (43, 44, 45) on the same 1000 paired prompts, matched | |
| sampling config (temp=0.7 top_p=0.95 top_k=50 max_new_tokens=3000). Combined with the | |
| existing seed-42 matched runs, this gives four independent samples per prompt per model. | |
| For K ∈ {1, 2, 4} per prompt the best sample is the one with **max pLannotate n_found** | |
| (oracle ranker — real rejection sampling would use a cheap proxy and do worse). Ties | |
| resolved by plausibility then by fidelity rate. | |
| Bucket paths: | |
| - Base seeds 43, 44, 45: `runs/bon_base_seed{43,44,45}_20260420_1303/` | |
| - GRPO seeds 43, 44, 45: `runs/bon_grpo_seed{43,44,45}_20260420_1303/` | |
| - Existing seed 42: the matched runs already in the bucket. | |
| Total cost: ~7.5 h (6 × 46 min generation + 6 × 30 min pLannotate + metrics). | |
| ### Per-seed sanity | |
| | seed | base plaus | GRPO plaus | base fid | GRPO fid | base AUC | GRPO AUC | | |
| |---:|---:|---:|---:|---:|---:|---:| | |
| | 42 | 56.8% | 59.6% | 35.5% | 41.6% | 1.000 | 0.937 | | |
| | 43 | 58.9% | 63.0% | 36.1% | 43.4% | 1.000 | 1.000 | | |
| | 44 | 57.1% | 61.6% | 34.1% | 42.6% | 1.000 | 1.000 | | |
| | 45 | 58.6% | 60.8% | 35.9% | 41.8% | 1.000 | 0.999 | | |
| Per-model run-to-run σ is about 1 pp on both metrics. The earlier "GRPO AUC drops to | |
| 0.937" was a single-seed artifact; GRPO AUC is ≈ 1.0 across three of four seeds. The | |
| original paper figure should drop the AUC delta claim. | |
| ### Oracle best-of-K (paired n=1000) | |
| | K | metric | base | GRPO | Δ | paired bootstrap 95% CI | McNemar p | | |
| |---:|---|---:|---:|---:|---|---:| | |
| | 1 | plausibility | 56.8% | 59.6% | +2.8 pp | [−1.5, +7.0] | 0.21 | | |
| | 2 | plausibility | 79.1% | 84.9% | +5.8 pp | [+2.5, +9.2] | 8×10⁻⁴ | | |
| | 4 | plausibility | 92.1% | 95.8% | +3.7 pp | [+1.7, +5.7] | 4×10⁻⁴ | | |
| | 1 | **useful (plaus ∧ fid ≥ 0.5)** | **38.7%** | **48.5%** | **+9.8 pp** | **[+5.7, +13.9]** | **6×10⁻⁶** | | |
| | 2 | useful | 60.4% | 73.6% | +13.2 pp | [+9.4, +17.0] | 9×10⁻¹¹ | | |
| | 4 | useful | 78.9% | 89.7% | +10.8 pp | [+7.9, +13.7] | 2×10⁻¹² | | |
| | 1 | fidelity mean | 35.5% | 41.6% | +6.1 pp | [+3.4, +8.8] | — | | |
| | 2 | fidelity mean | 51.6% | 59.7% | +8.0 pp | [+5.7, +10.4] | — | | |
| | 4 | fidelity mean | 64.6% | 72.2% | +7.6 pp | [+6.0, +9.2] | — | | |
| ### Interpretation | |
| - **Rejection sampling does not close the GRPO gap.** At K=4, GRPO still leads by +10.8 pp | |
| on the composite useful metric (p ≈ 2e-12). Base@K=4 (78.9% useful) is ~16 pp behind | |
| GRPO@K=4 (89.7%) and only matches GRPO@K=2 (73.6%) — i.e. RL buys more than a factor of | |
| two in inference-time compute on this metric. | |
| - Plausibility also separates only at K ≥ 2 (p=0.21 at K=1, p<0.001 at K=2 and K=4). At | |
| K=1 the plausibility delta is not significant — this is why the K=1 matched comparison | |
| looked weaker than it is. | |
| - Fidelity gain is stable across K (+6 to +8 pp). | |
| - Because the ranker is ground-truth pLannotate, real-world rejection sampling (which | |
| needs a cheap proxy scorer) would reproduce a smaller fraction of these gains for base, | |
| so the "RS doesn't close the gap" claim holds a fortiori. | |
| ### Script | |
| `eval/scripts/analyze_best_of_k.py` (committed, PEP 723 script). Reads | |
| `/tmp/plasmidlm_eval_cache/…` populated via `hf buckets sync`. Recomputes this table. | |
Xet Storage Details
- Size:
- 17.1 kB
- Xet hash:
- 9bf14a57ccafed864d540581eb3af653c883213f14867960fa5a4e79606f36e8
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.