McClain's picture
Fix broken W&B link; add reconstructed training config (README.md)
3af307c verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - biology
  - genomics
  - dna
  - plasmid
  - synthetic-biology
  - causal-lm
  - reinforcement-learning
  - grpo
  - custom_code
base_model: McClain/PlasmidLM-kmer6
datasets:
  - custom
pipeline_tag: text-generation
model-index:
  - name: PlasmidLM-kmer6-GRPO-plannotate
    results:
      - task:
          type: text-generation
          name: Plasmid DNA Generation
        metrics:
          - name: pLannotate Hit Rate (t=0.3)
            type: accuracy
            value: 0.68
          - name: pLannotate Hit Rate (t=0.5)
            type: accuracy
            value: 0.688

PlasmidLM-kmer6-GRPO-plannotate

A 19.3M parameter plasmid DNA generation model, post-trained with GRPO (Group Relative Policy Optimization) using pLannotate biological annotations as a reward signal. Fine-tuned from McClain/PlasmidLM-kmer6.

What's New vs Base Model

This model was post-trained with reinforcement learning to improve the biological accuracy of generated plasmid sequences. Instead of only learning sequence statistics, the model was optimized to produce sequences where requested functional elements (antibiotic resistance genes, origins of replication, promoters, etc.) are verifiably present when analyzed by the pLannotate annotation tool.

Metric Base Model GRPO-plannotate Improvement
Overall Hit Rate 59.2% 68.0% +8.8pp
AMR (Antibiotic Resistance) 63.8% 70.7% +6.9pp
ORI (Origin of Replication) 73.6% 80.6% +7.0pp
PROM (Promoters) 66.9% 72.4% +5.5pp
ELEM (Other Elements) 52.9% 51.0% -1.9pp
REPORTER 17.6% 17.6% 0pp

Evaluated on 50 held-out validation prompts with best-of-3 sampling at temperature 0.3.

Model Details

Property Value
Parameters 19.3M
Architecture Transformer decoder (dense MLP), LLaMA-style
Hidden size 384
Layers 10
Attention heads 8
Intermediate size 1,536
Max sequence length 16,384 tokens
Tokenizer k-mer (k=6, stride=3)
Vocab size 4,208

Training

Pretraining

  • Data: ~100K plasmid sequences from Addgene, tokenized with k-mer (k=6, stride=3)
  • Base checkpoint: McClain/PlasmidLM-kmer6 (65K steps, eval loss 0.129)

GRPO Post-Training

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Reward: pLannotate biological annotation. Generated sequences are annotated with pLannotate; the reward reflects how many requested functional elements are found with ≥ 95% sequence identity.
  • Uploaded checkpoint: step 800 (of a scheduled 1000-step run that stopped at step 900)
  • Infrastructure: Anyscale (Ray-based), single GPU, ≈ 17 h wall time

The full configuration is included in this repository as training_config.py. Key hyperparameters:

Parameter Value
Base model McClain/PlasmidLM-kmer6
Precision bfloat16
KL coefficient 0.5
Clip range (PPO-style) 0.2
Generations per prompt (group size) 8
Micro-batch size 8
Prompt batch size 8
Learning rate 5 × 10⁻⁶
Warmup steps 50
Max grad norm 1.0
Scheduled total steps 1000
Checkpoint every 100 steps
Rollout max new tokens 2500
Rollout temperature 0.3
Rollout top-p 0.95
Hard-token filter on prompts enabled
Prompt source training_pairs_v4.parquet (Addgene-derived multi-token prompts)
Scorer plannotate (BLAST against plannotate_db.parquet + motif_registry_combined.parquet)

Note on reproducibility. W&B logs for this specific training run are not available. The configuration shown above and in training_config.py is the exact file committed to the training repository at commit a5af0da (the commit whose on-disk config names and checkpoint cadence match the uploaded weights byte-for-byte). A full reward-vs-step curve for this exact run cannot be reconstructed without re-running training.

Temperature Sensitivity

Temperature Hit Rate
0.1 66.8%
0.3 68.0%
0.5 68.8%
0.7 50.7%*
1.0 30.1%*

Evaluated on step_100 checkpoint; step_800 performs better at all temperatures. Recommended range: 0.3 – 0.5.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "McClain/PlasmidLM-kmer6-GRPO-plannotate"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate a plasmid with kanamycin resistance, ColE1 origin, and T7 promoter
prompt = "<BOS> <AMR_KANAMYCIN> <ORI_COLE1> <PROM_T7> <SEP>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=3000,
        temperature=0.3,
        do_sample=True,
        top_k=50,
    )

sequence = tokenizer.decode(outputs[0].tolist())
print(sequence)

# Extract just the DNA sequence
import re
dna = re.sub(r"<[^>]+>", "", sequence.upper())
dna = re.sub(r"[^ATGCN]", "", dna)
print(f"Generated {len(dna)} bp plasmid sequence")

Input Format

<BOS> <TOKEN1> <TOKEN2> ... <SEP>

The model generates k-mer encoded DNA after <SEP> until <EOS> or max length. Spaces between tokens are optional but recommended.

Available Component Tokens

Category Tokens
Antibiotic Resistance (AMR) AMPICILLIN, KANAMYCIN, CHLORAMPHENICOL, SPECTINOMYCIN, GENTAMICIN, PUROMYCIN, HYGROMYCIN, BLASTICIDIN, NEOMYCIN, ZEOCIN, TETRACYCLINE
Origin of Replication (ORI) COLE1, F1, P15A, PSC101, SV40, 2MU, RSF
Promoter (PROM) CMV, T7, U6, EF1A, CAG, LAC, SV40, AMPR, RSV, SP6, T3
Reporter EGFP, GFP, MCHERRY, YFP, NANOLUC, LUCIFERASE
Tags HIS, FLAG, MYC, HA, GST, NLS
Elements (ELEM) WPRE, POLYA_BGH, POLYA_SV40, CMV_ENHANCER, MCS, LTR_5, LTR_3, PSI, CPPT, AAV_ITR, GRNA_SCAFFOLD

Format: <CATEGORY_NAME>, e.g. <AMR_KANAMYCIN>, <ORI_COLE1>, <PROM_T7>

Limitations

  • Generated sequences are not experimentally validated. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
  • The model was trained on Addgene plasmids, biased toward commonly deposited vectors.
  • Reporter and Tag categories have low hit rates and may need further RL training.
  • Maximum context of 16K tokens.

Citation

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation with Reinforcement Learning},
  author={Thiel, McClain},
  year={2026},
  url={https://huggingface.co/McClain/PlasmidLM-kmer6-GRPO-plannotate}
}