Buckets:

McClain
/

PlasmidLMEval

Files

xet

McClain/PlasmidLMEval / eval_log.md

McClain

about 1 month ago

preview code

download

raw

17.1 kB

Eval Suite Log

2026-04-08

11:30 — Phase 0 started. Installed sourmash 4.3, minimap2 2.30, lightgbm 4.6, shap 0.49, sklearn 1.7 into plannotate conda env. Fixed sourmash pkg_resources issue (setuptools<70).

11:45 — Downloaded Addgene-500 reference panel from HF bucket McClain/PlasmidRL/reference/. 500 plasmids, 418 not in training set. Pre-computed: length, GC, longest ORF, MFE density, 3-mer freqs.

11:50 — Pinned tool versions in eval/config/environment.yml. pLannotate 1.2.2 (snapgene DB 2021-11), prodigal 2.6.3, BLAST+ 2.17.0.

11:55 — Built eval/config/feature_categories.yaml mapping pLannotate feature types to Tier 3 booleans (origin, selection_marker, promoter, terminator, cds).

12:00 — Built training-set sourmash signature: 20,644 unique plasmids, k=31 scaled=100. Cached at eval/reference/training_sigs.zip.

12:05 — Building negative controls (random GC-matched, dinucleotide-shuffled, real held-out). Writing generation + annotation pipeline scripts.

12:30 — Negative controls built (500 each: random, shuffled, real). Wrote pipeline scripts: generate.py, annotate.py, compute_metrics.py, firstlight_sweep.py.

13:00 — Tested generation: 5 seqs in 21s, mean 8.1kb, 54% GC. Fixed tokenizer padding issue (custom PlasmidKmerTokenizer doesn't expose pad/eos tokens properly).

14:00 — Tested annotation pipeline end-to-end: pLannotate 3.1s/seq, prodigal 0.3s total, dustmasker 0.1s total. Results look good — 3/5 test sequences show >60% pLannotate coverage.

14:15 — Launched Phase 1 first-light sweep: 12 cells (temp×top_p×rep_pen), 50 seqs/cell = 600 total. Model loaded once, reused across cells. ETA ~45-60 min.

14:16 — huggingface-hub version conflict flagged: upgraded to 1.9.2 for bucket API, but transformers needs <1.0. Downgraded to 0.36.2. Bucket uploads will need separate handling.

14:50 — First-light sweep complete. 12 cells × 50 seqs = 600 generations in 29.6 min. pLannotate annotation initially failed due to --no-banner flag — ran retroactively on all cells (~3 min total).

14:55 — Winner: t0.7_p0.95_r1.0 (temp=0.7, top_p=0.95, rep_pen=1.0). 70.8% median pLannotate coverage, 22.1 mean features, 7.6kb mean length. Clear winner — temp=0.7 dominates, higher temps drop coverage significantly. Repetition penalty doesn't help.

15:05 — Extended sweep to test temp=0.3 and 0.5. Coverage plateaus at ~70% across 0.3-0.7, but temp=0.7 has longest sequences (7.6kb, closest to Addgene ref 7.5kb) and fewest <1kb failures. Keeping temp=0.7.

15:15 — Uploaded all Phase 0+1 data to HF bucket McClain/PlasmidLMEval (81 files, 17.2 MB). Used hf buckets sync CLI — the Python batch_bucket_files() API silently fails for non-README files.

15:20 — Phase 2 started: generating 1000 plasmids with winning config (temp=0.7, top_p=0.95, top_k=50, no rep penalty). Model: McClain/PlasmidLM-kmer6-GRPO-plannotate.

16:06 — Phase 2 complete. 1000 sequences in 46.3 min (21.6 seq/min). Mean length 7.2kb, median 9.0kb, 50.6% GC, 48.7% EOS rate, 5.6% <1kb. Baselines also annotated: random 5.2% hit rate, shuffled 7.6%, real 100%.

16:07 — Phase 3 started: running pLannotate (8 workers) + Prodigal + dustmasker on all 1000 generated sequences.

16:40 — Phase 3 complete. pLannotate: ~2s/seq with 8 workers on 1000 seqs. Prodigal and dustmasker fast (<30s total).

16:41 — Phase 4 metrics computed on all 1000 generated sequences:

Tier 1: Distributional

Length KS stat: 0.256 (p < 1e-19) — significant difference from reference (generated skews longer)
GC KS stat: 0.121 (p = 0.0001) — slight difference, generated slightly lower GC
Wasserstein distances: length=1338bp, GC=0.020

Tier 3: Essentials

Metric	Generated (n=1000)	Real Addgene (n=500)	Random (n=500)	Shuffled (n=500)
has_ori	63.8%	—	—	—
has_selection_marker	65.2%	—	—	—
has_promoter	83.0%	—	—	—
has_terminator	60.1%	—	—	—
has_cds	85.0%	—	—	—
plausibility_pass	59.6%	95.0%	0.0%	0.0%
mean n_features	19.2	—	—	—
mean coverage_frac	50.1%	—	—	—

Tier 5: Architecture

CDS with promoter context (within 500bp): 43.0%
CDS with terminator context: 8.6%
Mean origins per plasmid: 1.6
Mean selection markers: 1.8
Mean overlapping features: 5.3

Memorization

Zero hits at containment threshold 0.3 — no evidence of training set memorization

Discriminator

LightGBM AUC: 0.937 (real vs generated)
Top features by importance: length (5627), ORF density (3533), 6-mer diversity (1735), GC (1419)
SHAP summary plot saved

16:45 — Baseline metrics computed: random plausibility 0%, shuffled 0%, real Addgene 95%. Generated model at 59.6% is well above floor and approaching ceiling.

17:30 — Diversity metrics computed. Model produces comparable diversity to real Addgene plasmids — no mode collapse.

Metric	Generated	Addgene-500
6-mer JSD (mean pairwise)	0.352	0.334
6-mer cosine distance	0.288	0.267
6-mer Jaccard similarity	0.778	0.793
Near-identical pairs (JSD<0.01)	0.0%	0.0%
Pairs with Jaccard >0.9	25.3%	28.3%

High Jaccard is expected — real plasmids share extensive backbone sequences (amp resistance, ColE1 ori, etc.). Generated plasmids are slightly more diverse than real ones on all k-mer metrics.

17:45 — Extra metrics computed: prompt fidelity, codon usage, GC skew, ViennaRNA MFE (running).

Prompt Fidelity (Conditional Accuracy)

Overall: 23.5% of requested features detected by pLannotate in generated sequences.

Category	Hit Rate	Found/Requested
AMR	39.2%	538/1373
PROM	38.5%	1090/2828
REPORTER	27.1%	95/350
ORI	19.9%	339/1701
ELEM	15.9%	634/3991
TAG	1.1%	2/179

Updated (sseqid-based via motif registry): 41.6% overall — properly matching against the 660 sseqids in the motif registry that map tokens to pLannotate feature IDs. Previous keyword-based approach (23.5%) was undercounting due to brittle string matching.

Category	Hit Rate	Found/Requested
ORI	51.9%	882/1701
ELEM	51.1%	2040/3991
PROM	39.5%	1118/2828
AMR	29.1%	399/1373
REPORTER	21.1%	74/350
TAG	2.8%	5/179

Tags remain very low — these are tiny peptide sequences (6-18 bp coding) that BLAST/DIAMOND struggle to detect even in real plasmids.

Detection ceiling: pLannotate on real training plasmids only detects 71.8% of their own tokens — this is the maximum achievable fidelity with this detection method. Generated model achieves 57.9% of ceiling (41.6/71.8%). Per-category ceiling ratios: ORI 73.7%, ELEM 72.9%, PROM 56.0%, AMR 40.9%, REPORTER 31.3%, TAG 4.2%.

Codon Usage

3,376 predicted ORFs, 579,368 total codons
JSD vs E. coli codon frequencies: 0.131 (low divergence = realistic codon usage)

GC Skew

Generated mean variation: 0.0566 (reference: 0.0593)
Very close to real Addgene plasmids — model captures characteristic GC skew patterns

ViennaRNA MFE Density (DNA Mathews2004 params)

Reference (Addgene-500): -0.151 kcal/mol/nt
Generated (n=200 subsample, 500bp windows): -0.144 kcal/mol/nt
Close to reference — generated plasmids have realistic thermodynamic stability
Note: RNA.fold O(n³) makes full-length folding impractical for >5kb sequences; used 500bp windowed sampling

2026-04-20 — Base-model (pre-GRPO) eval + paired comparison

Context. The original runs/grpo_plannotate_full_20260408/ evaluated only the post-GRPO model (McClain/PlasmidLM-kmer6-GRPO-plannotate). For the NeurIPS submission we need a matched pre-GRPO baseline, paired per-prompt, so we can attribute the improvement in plausibility_pass and per-category hit rate specifically to the post-training step.

GRPO run that produced the uploaded model. One-shot GRPO from the pretrained base (McClain/PlasmidLM-kmer6) with the pLannotate composite reward, no curriculum, no motif warm-start. Config: kl=0.5, clip=0.2, group size 8, prompt batch 8, lr 5e-6, warmup 50, no length penalty; rollouts at temp=0.3 / top_p=0.95 / max_new_tokens=2500. Scheduled 1000 steps, stopped at ~900, step_800 uploaded. The W&B logs for this run are not recoverable; the reconstructed config is committed to the HF repo as training_config.py (matches the S3 checkpoint byte-for-byte via SHA verification).

Paired prompt set. eval/data/paired_prompts_20260420.parquet (1000 rows, 1000 unique) — extracted directly from runs/grpo_plannotate_full_20260408/generations.parquet so the two runs share identical prompts in identical order.

Generation (g6-big, ~67 min total)

Matched-config run (Phase A, paired with GRPO):

Model source: McClain/PlasmidLM-kmer6 downloaded to /opt/dlami/nvme/models/plasmid_lm_kmer6_local/.
Modeling code swap: the base repo's modeling_plasmid_lm.py predates the generate_simple(attention_mask, top_p, eos_token_id, pad_token_id) signature used by the eval pipeline. To keep the two runs on an identical generation path, we replaced it with the copy from McClain/PlasmidLM-kmer6-GRPO-plannotate. Weights, tokenizer, config are unchanged. Forward pass is byte-for-byte equivalent; only the inference helper was upgraded.
Sampling: temp=0.7 top_p=0.95 top_k=50 max_new_tokens=3000 batch_size=8 seed=42.
1000 sequences in 46.7 min, mean length 6.9 kb, median 7.8 kb, 60.7% EOS rate, 3.4% < 1 kb (GRPO for comparison: 7.2 kb / 9.0 kb / 48.7% / 5.6% — base emits EOS more reliably; GRPO shows mild length drift).
Output: runs/base_kmer6_matched_20260420_1303/ in bucket.

Temperature sweep (Phase B, base only, not paired):

Temps ∈ {0.3, 0.5, 1.0} × 100 seqs each from training_pairs_v4.parquet. t=0.7 already covered by Phase A.
Output: runs/base_kmer6_sweep_20260420_1303/t{03,05,10}/.

Post-processing fixes

The plannotate conda env on g6-big had drifted: joblib, lightgbm, shap, sklearn, statsmodels, scipy, pyyaml, sourmash were all missing despite environment.yml listing them. Pip-installed. sourmash downgraded cachetools 7.0.1→5.5.2 (did not break pLannotate). prodigal CLI also missing — installed via apt-get install prodigal.
Bug in eval/scripts/annotate.py: run_dustmasker passes the raw FASTA to dustmasker, which aborts on any zero-residue record. The matched run had 1 empty record (base model emitting EOS immediately for one prompt). Fixed in-place by pre-filtering empty records and reporting the dropped count.
Prompt fidelity method correction. compute_extra_metrics.compute_prompt_fidelity was using a hardcoded keyword-matching dict that yielded ~23% on the GRPO run. The supposed "sseqid via motif registry" method (41.6% overall from 2026-04-09) was never committed — only the eval_log note. Today we implemented it properly: load motif_registry_combined.parquet, build token -> set[sseqid], per-generation score is set intersection with the sseqids pLannotate emits. Matches the 41.6% number on the GRPO run (ORI 51.9%, ELEM 51.1%, PROM 39.5%, AMR 29.1%, REPORTER 21.1%, TAG 2.8% — all identical to the prior note). Also now writes metrics/prompt_fidelity_per_seq.parquet (one row per prompt) so paired analyses are possible without recomputing.

Paired results (matched t=0.7, n=1000)

Metric	Base	GRPO	Δ
plausibility_pass (structural: has ori + marker + (prom OR cds))	56.8%	59.6%	+2.8 pp
prompt_fidelity overall (sseqid intersection)	35.5%	41.6%	+6.1 pp
fidelity — AMR	26.1%	29.1%	+3.0 pp
fidelity — ORI	44.1%	51.9%	+7.8 pp
fidelity — PROM	35.3%	39.5%	+4.2 pp
fidelity — ELEM	46.1%	51.1%	+5.0 pp
fidelity — REPORTER	19.1%	21.1%	+2.0 pp
fidelity — TAG	8.4%	2.8%	−5.6 pp
discriminator AUC vs Addgene-500	1.000	0.937	−0.063
memorization hits (sourmash containment ≥ 0.3)	0	0	—

Base temperature sweep (n=100 per cell)

Temp	plausibility	prompt_fidelity	disc AUC
0.3	60.0%	37.8%	1.000
0.5	56.0%	32.5%	1.000
0.7 (matched, n=1000)	56.8%	35.5%	1.000
1.0	42.0%	32.3%	1.000

Takeaways

Headline "+8.8 pp plausibility" on the HF model card does not hold up under matched paired sampling; the matched delta is +2.8 pp on plausibility but +6.1 pp on prompt fidelity. Paper figure should lead with fidelity.
Base is generating plausible-but-generic plasmids: similar structural pass rate, but substantially lower on actually containing the requested features. RL mostly improves conditioning, not plasmid-ness.
ORI and ELEM are the biggest fidelity gains (+7.8 / +5.0 pp). TAG regresses (−5.6 pp): short peptide tags rarely get scored by BLAST, so RL learns to deprioritise them.
Discriminator AUC goes from 1.000 (base → trivially separable from real Addgene) to 0.937 (GRPO → real artifacts) — RL reduces generator-specific signatures. Need to inspect SHAP top features on both to confirm this isn't a length/GC artifact.
Need next: paired McNemar + bootstrap CIs on the 1000-prompt data (analysis is in notebooks/base_vs_grpo_analysis.py).

2026-04-20 — Best-of-K rejection sampling, paired, base vs GRPO

Motivation. Reviewer would ask: "can inference-time compute (best-of-K sampling) close the GRPO gap?" A strong negative answer — "GRPO still wins at matched K" — is the cleanest paper story.

Design

Three additional seeds per model (43, 44, 45) on the same 1000 paired prompts, matched sampling config (temp=0.7 top_p=0.95 top_k=50 max_new_tokens=3000). Combined with the existing seed-42 matched runs, this gives four independent samples per prompt per model.

For K ∈ {1, 2, 4} per prompt the best sample is the one with max pLannotate n_found (oracle ranker — real rejection sampling would use a cheap proxy and do worse). Ties resolved by plausibility then by fidelity rate.

Bucket paths:

Base seeds 43, 44, 45: runs/bon_base_seed{43,44,45}_20260420_1303/
GRPO seeds 43, 44, 45: runs/bon_grpo_seed{43,44,45}_20260420_1303/
Existing seed 42: the matched runs already in the bucket.

Total cost: ~7.5 h (6 × 46 min generation + 6 × 30 min pLannotate + metrics).

Per-seed sanity

seed	base plaus	GRPO plaus	base fid	GRPO fid	base AUC	GRPO AUC
42	56.8%	59.6%	35.5%	41.6%	1.000	0.937
43	58.9%	63.0%	36.1%	43.4%	1.000	1.000
44	57.1%	61.6%	34.1%	42.6%	1.000	1.000
45	58.6%	60.8%	35.9%	41.8%	1.000	0.999

Per-model run-to-run σ is about 1 pp on both metrics. The earlier "GRPO AUC drops to 0.937" was a single-seed artifact; GRPO AUC is ≈ 1.0 across three of four seeds. The original paper figure should drop the AUC delta claim.

Oracle best-of-K (paired n=1000)

K	metric	base	GRPO	Δ	paired bootstrap 95% CI	McNemar p
1	plausibility	56.8%	59.6%	+2.8 pp	[−1.5, +7.0]	0.21
2	plausibility	79.1%	84.9%	+5.8 pp	[+2.5, +9.2]	8×10⁻⁴
4	plausibility	92.1%	95.8%	+3.7 pp	[+1.7, +5.7]	4×10⁻⁴
1	useful (plaus ∧ fid ≥ 0.5)	38.7%	48.5%	+9.8 pp	[+5.7, +13.9]	6×10⁻⁶
2	useful	60.4%	73.6%	+13.2 pp	[+9.4, +17.0]	9×10⁻¹¹
4	useful	78.9%	89.7%	+10.8 pp	[+7.9, +13.7]	2×10⁻¹²
1	fidelity mean	35.5%	41.6%	+6.1 pp	[+3.4, +8.8]	—
2	fidelity mean	51.6%	59.7%	+8.0 pp	[+5.7, +10.4]	—
4	fidelity mean	64.6%	72.2%	+7.6 pp	[+6.0, +9.2]	—

Interpretation

Rejection sampling does not close the GRPO gap. At K=4, GRPO still leads by +10.8 pp on the composite useful metric (p ≈ 2e-12). Base@K=4 (78.9% useful) is ~16 pp behind GRPO@K=4 (89.7%) and only matches GRPO@K=2 (73.6%) — i.e. RL buys more than a factor of two in inference-time compute on this metric.
Plausibility also separates only at K ≥ 2 (p=0.21 at K=1, p<0.001 at K=2 and K=4). At K=1 the plausibility delta is not significant — this is why the K=1 matched comparison looked weaker than it is.
Fidelity gain is stable across K (+6 to +8 pp).
Because the ranker is ground-truth pLannotate, real-world rejection sampling (which needs a cheap proxy scorer) would reproduce a smaller fraction of these gains for base, so the "RS doesn't close the gap" claim holds a fortiori.

Script

eval/scripts/analyze_best_of_k.py (committed, PEP 723 script). Reads /tmp/plasmidlm_eval_cache/… populated via hf buckets sync. Recomputes this table.

Xet Storage Details

Size:: 17.1 kB
Xet hash:: 9bf14a57ccafed864d540581eb3af653c883213f14867960fa5a4e79606f36e8

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.