PlasmidLM-kmer6-GRPO-plannotate / README.md

Fix broken W&B link; add reconstructed training config (README.md)

3af307c verified 15 days ago

6.97 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - biology
  - genomics
  - dna
  - plasmid
  - synthetic-biology
  - causal-lm
  - reinforcement-learning
  - grpo
  - custom_code
base_model: McClain/PlasmidLM-kmer6
datasets:
  - custom
pipeline_tag: text-generation
model-index:
  - name: PlasmidLM-kmer6-GRPO-plannotate
    results:
      - task:
          type: text-generation
          name: Plasmid DNA Generation
        metrics:
          - name: pLannotate Hit Rate (t=0.3)
            type: accuracy
            value: 0.68
          - name: pLannotate Hit Rate (t=0.5)
            type: accuracy
            value: 0.688

PlasmidLM-kmer6-GRPO-plannotate

A 19.3M parameter plasmid DNA generation model, post-trained with GRPO (Group Relative Policy Optimization) using pLannotate biological annotations as a reward signal. Fine-tuned from McClain/PlasmidLM-kmer6.

What's New vs Base Model

This model was post-trained with reinforcement learning to improve the biological accuracy of generated plasmid sequences. Instead of only learning sequence statistics, the model was optimized to produce sequences where requested functional elements (antibiotic resistance genes, origins of replication, promoters, etc.) are verifiably present when analyzed by the pLannotate annotation tool.

Metric	Base Model	GRPO-plannotate	Improvement
Overall Hit Rate	59.2%	68.0%	+8.8pp
AMR (Antibiotic Resistance)	63.8%	70.7%	+6.9pp
ORI (Origin of Replication)	73.6%	80.6%	+7.0pp
PROM (Promoters)	66.9%	72.4%	+5.5pp
ELEM (Other Elements)	52.9%	51.0%	-1.9pp
REPORTER	17.6%	17.6%	0pp

Evaluated on 50 held-out validation prompts with best-of-3 sampling at temperature 0.3.

Model Details

Property	Value
Parameters	19.3M
Architecture	Transformer decoder (dense MLP), LLaMA-style
Hidden size	384
Layers	10
Attention heads	8
Intermediate size	1,536
Max sequence length	16,384 tokens
Tokenizer	k-mer (k=6, stride=3)
Vocab size	4,208

Training

Pretraining

Data: ~100K plasmid sequences from Addgene, tokenized with k-mer (k=6, stride=3)
Base checkpoint: McClain/PlasmidLM-kmer6 (65K steps, eval loss 0.129)

GRPO Post-Training

Algorithm: GRPO (Group Relative Policy Optimization)
Reward: pLannotate biological annotation. Generated sequences are annotated with pLannotate; the reward reflects how many requested functional elements are found with ≥ 95% sequence identity.
Uploaded checkpoint: step 800 (of a scheduled 1000-step run that stopped at step 900)
Infrastructure: Anyscale (Ray-based), single GPU, ≈ 17 h wall time

The full configuration is included in this repository as training_config.py. Key hyperparameters:

Parameter	Value
Base model	`McClain/PlasmidLM-kmer6`
Precision	bfloat16
KL coefficient	0.5
Clip range (PPO-style)	0.2
Generations per prompt (group size)	8
Micro-batch size	8
Prompt batch size	8
Learning rate	5 × 10⁻⁶
Warmup steps	50
Max grad norm	1.0
Scheduled total steps	1000
Checkpoint every	100 steps
Rollout max new tokens	2500
Rollout temperature	0.3
Rollout top-p	0.95
Hard-token filter on prompts	enabled
Prompt source	`training_pairs_v4.parquet` (Addgene-derived multi-token prompts)
Scorer	`plannotate` (BLAST against `plannotate_db.parquet` + `motif_registry_combined.parquet`)

Note on reproducibility. W&B logs for this specific training run are not available. The configuration shown above and in training_config.py is the exact file committed to the training repository at commit a5af0da (the commit whose on-disk config names and checkpoint cadence match the uploaded weights byte-for-byte). A full reward-vs-step curve for this exact run cannot be reconstructed without re-running training.

Temperature Sensitivity

Temperature	Hit Rate
0.1	66.8%
0.3	68.0%
0.5	68.8%
0.7	50.7%*
1.0	30.1%*

Evaluated on step_100 checkpoint; step_800 performs better at all temperatures. Recommended range: 0.3 – 0.5.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "McClain/PlasmidLM-kmer6-GRPO-plannotate"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate a plasmid with kanamycin resistance, ColE1 origin, and T7 promoter
prompt = "<BOS> <AMR_KANAMYCIN> <ORI_COLE1> <PROM_T7> <SEP>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=3000,
        temperature=0.3,
        do_sample=True,
        top_k=50,
    )

sequence = tokenizer.decode(outputs[0].tolist())
print(sequence)

# Extract just the DNA sequence
import re
dna = re.sub(r"<[^>]+>", "", sequence.upper())
dna = re.sub(r"[^ATGCN]", "", dna)
print(f"Generated {len(dna)} bp plasmid sequence")

Input Format

<BOS> <TOKEN1> <TOKEN2> ... <SEP>

The model generates k-mer encoded DNA after <SEP> until <EOS> or max length. Spaces between tokens are optional but recommended.

Available Component Tokens

Category	Tokens
Antibiotic Resistance (AMR)	AMPICILLIN, KANAMYCIN, CHLORAMPHENICOL, SPECTINOMYCIN, GENTAMICIN, PUROMYCIN, HYGROMYCIN, BLASTICIDIN, NEOMYCIN, ZEOCIN, TETRACYCLINE
Origin of Replication (ORI)	COLE1, F1, P15A, PSC101, SV40, 2MU, RSF
Promoter (PROM)	CMV, T7, U6, EF1A, CAG, LAC, SV40, AMPR, RSV, SP6, T3
Reporter	EGFP, GFP, MCHERRY, YFP, NANOLUC, LUCIFERASE
Tags	HIS, FLAG, MYC, HA, GST, NLS
Elements (ELEM)	WPRE, POLYA_BGH, POLYA_SV40, CMV_ENHANCER, MCS, LTR_5, LTR_3, PSI, CPPT, AAV_ITR, GRNA_SCAFFOLD

Format: <CATEGORY_NAME>, e.g. <AMR_KANAMYCIN>, <ORI_COLE1>, <PROM_T7>

Limitations

Generated sequences are not experimentally validated. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
The model was trained on Addgene plasmids, biased toward commonly deposited vectors.
Reporter and Tag categories have low hit rates and may need further RL training.
Maximum context of 16K tokens.

Citation

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation with Reinforcement Learning},
  author={Thiel, McClain},
  year={2026},
  url={https://huggingface.co/McClain/PlasmidLM-kmer6-GRPO-plannotate}
}