PlasmidLM
A 17M-parameter transformer language model for conditional generation of synthetic plasmid DNA sequences.
Model Description
PlasmidLM generates plasmid DNA sequences conditioned on functional component specifications. Given a prompt specifying desired elements (antibiotic resistance genes, origins of replication, promoters, reporters, etc.), it autoregressively generates a complete DNA sequence containing those elements.
Architecture: LLaMA-style transformer decoder with RoPE, RMSNorm, and GELU activations.
| Parameter | Value |
|---|---|
| Parameters | 17M |
| Hidden size | 384 |
| Layers | 10 |
| Attention heads | 8 |
| Context length | 16,384 tokens |
| Vocabulary | 120 tokens |
The vocabulary consists of 5 DNA bases (A, T, C, G, N), control tokens (BOS, EOS, SEP, PAD, UNK), and ~100 categorical tokens representing functional plasmid components (e.g., <AMR_KANAMYCIN>, <ORI_COLE1>, <PROM_T7>).
Training
Pretrained with causal language modeling on ~108K plasmid sequences derived from the Addgene repository, annotated with functional components via pLannotate.
- Steps: 15,000
- Epochs: ~2.3
- Eval loss: 0.093
- Token accuracy: 96.1%
- Optimizer: AdamW
- Precision: bf16
Intended Use
This is a base pretrained model. It has learned the statistical patterns of plasmid DNA sequences and their relationship to categorical component tokens. It can be used for:
- Direct generation: Prompt with component tokens to generate plasmid sequences
- Fine-tuning: Post-train with reinforcement learning (GRPO/PPO) to improve motif placement accuracy
- Embeddings: Use hidden states as learned representations of plasmid sequences
- Research: Study the learned structure of synthetic DNA
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
# Generate a plasmid with kanamycin resistance and ColE1 origin
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True)
sequence = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(sequence)
Input Format
<BOS><TOKEN1><TOKEN2>...<SEP>
The model generates DNA bases (A/T/C/G) after the <SEP> token until it produces <EOS> or hits the maximum length.
Component Categories
| Category | Examples | Count |
|---|---|---|
| Antibiotic Resistance (AMR) | Kanamycin, Ampicillin, Chloramphenicol, ... | 11 |
| Origin of Replication (ORI) | ColE1, F1, P15A, pSC101, SV40, ... | 7 |
| Promoter (PROM) | CMV, T7, U6, EF1a, CAG, ... | 11 |
| Reporter | EGFP, mCherry, YFP, NanoLuc, ... | 6 |
| Vector Type (VEC) | Lentiviral, CRISPR, Bacterial, AAV, ... | 10 |
| Other | Tags, elements, species, backbones | ~55 |
Limitations
- This is a pretrained base model -- it learns sequence statistics but has not been optimized for motif placement accuracy. Post-training with RL significantly improves functional element fidelity.
- Generated sequences are not experimentally validated. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
- The model was trained on Addgene plasmids, which are biased toward commonly deposited vectors (mammalian expression, bacterial cloning, CRISPR).
- Maximum context of 16K tokens (~16 kbp), which covers most but not all plasmids.
Citation
@misc{thiel2026plasmidlm,
title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation},
author={Thiel, McClain},
year={2026},
url={https://huggingface.co/McClain/PlasmidLM}
}
- Downloads last month
- 134
Evaluation results
- Eval Lossself-reported0.093
- Token Accuracyself-reported0.961