Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

PlasmidLM-kmer6-GRPO-plannotate

A 19.3M parameter plasmid DNA generation model, post-trained with GRPO (Group Relative Policy Optimization) using pLannotate biological annotations as a reward signal. Fine-tuned from McClain/PlasmidLM-kmer6.

What's New vs Base Model

This model was post-trained with reinforcement learning to improve the biological accuracy of generated plasmid sequences. Instead of only learning sequence statistics, the model was optimized to produce sequences where requested functional elements (antibiotic resistance genes, origins of replication, promoters, etc.) are verifiably present when analyzed by the pLannotate annotation tool.

Metric Base Model GRPO-plannotate Improvement
Overall Hit Rate 59.2% 68.0% +8.8pp
AMR (Antibiotic Resistance) 63.8% 70.7% +6.9pp
ORI (Origin of Replication) 73.6% 80.6% +7.0pp
PROM (Promoters) 66.9% 72.4% +5.5pp
ELEM (Other Elements) 52.9% 51.0% -1.9pp
REPORTER 17.6% 17.6% 0pp

Evaluated on 50 held-out validation prompts with best-of-3 sampling at temperature 0.3.

Model Details

Property Value
Parameters 19.3M
Architecture Transformer decoder (dense MLP), LLaMA-style
Hidden size 384
Layers 10
Attention heads 8
Intermediate size 1,536
Max sequence length 16,384 tokens
Tokenizer k-mer (k=6, stride=3)
Vocab size 4,208

Training

Pretraining

  • Data: ~100K plasmid sequences from Addgene, tokenized with k-mer (k=6, stride=3)
  • Base checkpoint: McClain/PlasmidLM-kmer6 (65K steps, eval loss 0.129)

GRPO Post-Training

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Reward: pLannotate biological annotation — generated sequences are annotated with pLannotate, and the reward reflects how many requested functional elements are found with >= 95% sequence identity
  • Steps: 800
  • Infrastructure: Anyscale (Ray-based distributed training)
  • W&B Run: mcclain/PlasmidLLM/runs/sil7t16f

Temperature Sensitivity

Temperature Hit Rate
0.1 66.8%
0.3 68.0%
0.5 68.8%
0.7 50.7%*
1.0 30.1%*

Evaluated on step_100 checkpoint; step_800 performs better at all temperatures. Recommended range: 0.3 - 0.5.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "McClain/PlasmidLM-kmer6-GRPO-plannotate"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate a plasmid with kanamycin resistance, ColE1 origin, and T7 promoter
prompt = "<BOS> <AMR_KANAMYCIN> <ORI_COLE1> <PROM_T7> <SEP>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=3000,
        temperature=0.3,
        do_sample=True,
        top_k=50,
    )

sequence = tokenizer.decode(outputs[0].tolist())
print(sequence)

# Extract just the DNA sequence
import re
dna = re.sub(r"<[^>]+>", "", sequence.upper())
dna = re.sub(r"[^ATGCN]", "", dna)
print(f"Generated {len(dna)} bp plasmid sequence")

Input Format

<BOS> <TOKEN1> <TOKEN2> ... <SEP>

The model generates k-mer encoded DNA after <SEP> until <EOS> or max length. Spaces between tokens are optional but recommended.

Available Component Tokens

Category Tokens
Antibiotic Resistance (AMR) AMPICILLIN, KANAMYCIN, CHLORAMPHENICOL, SPECTINOMYCIN, GENTAMICIN, PUROMYCIN, HYGROMYCIN, BLASTICIDIN, NEOMYCIN, ZEOCIN, TETRACYCLINE
Origin of Replication (ORI) COLE1, F1, P15A, PSC101, SV40, 2MU, RSF
Promoter (PROM) CMV, T7, U6, EF1A, CAG, LAC, SV40, AMPR, RSV, SP6, T3
Reporter EGFP, GFP, MCHERRY, YFP, NANOLUC, LUCIFERASE
Tags HIS, FLAG, MYC, HA, GST, NLS
Elements (ELEM) WPRE, POLYA_BGH, POLYA_SV40, CMV_ENHANCER, MCS, LTR_5, LTR_3, PSI, CPPT, AAV_ITR, GRNA_SCAFFOLD

Format: <CATEGORY_NAME>, e.g. <AMR_KANAMYCIN>, <ORI_COLE1>, <PROM_T7>

Limitations

  • Generated sequences are not experimentally validated. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
  • The model was trained on Addgene plasmids, biased toward commonly deposited vectors.
  • Reporter and Tag categories have low hit rates and may need further RL training.
  • Maximum context of 16K tokens.

Citation

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation with Reinforcement Learning},
  author={Thiel, McClain},
  year={2026},
  url={https://huggingface.co/McClain/PlasmidLM-kmer6-GRPO-plannotate}
}
Downloads last month
154
Safetensors
Model size
19.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for McClain/PlasmidLM-kmer6-GRPO-plannotate

Finetuned
(1)
this model

Space using McClain/PlasmidLM-kmer6-GRPO-plannotate 1

Evaluation results