Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

PlasmidLM

A 17M-parameter transformer language model for conditional generation of synthetic plasmid DNA sequences.

Model Description

PlasmidLM generates plasmid DNA sequences conditioned on functional component specifications. Given a prompt specifying desired elements (antibiotic resistance genes, origins of replication, promoters, reporters, etc.), it autoregressively generates a complete DNA sequence containing those elements.

Architecture: LLaMA-style transformer decoder with RoPE, RMSNorm, and GELU activations.

Parameter Value
Parameters 17M
Hidden size 384
Layers 10
Attention heads 8
Context length 16,384 tokens
Vocabulary 120 tokens

The vocabulary consists of 5 DNA bases (A, T, C, G, N), control tokens (BOS, EOS, SEP, PAD, UNK), and ~100 categorical tokens representing functional plasmid components (e.g., <AMR_KANAMYCIN>, <ORI_COLE1>, <PROM_T7>).

Training

Pretrained with causal language modeling on ~108K plasmid sequences derived from the Addgene repository, annotated with functional components via pLannotate.

  • Steps: 15,000
  • Epochs: ~2.3
  • Eval loss: 0.093
  • Token accuracy: 96.1%
  • Optimizer: AdamW
  • Precision: bf16

Intended Use

This is a base pretrained model. It has learned the statistical patterns of plasmid DNA sequences and their relationship to categorical component tokens. It can be used for:

  • Direct generation: Prompt with component tokens to generate plasmid sequences
  • Fine-tuning: Post-train with reinforcement learning (GRPO/PPO) to improve motif placement accuracy
  • Embeddings: Use hidden states as learned representations of plasmid sequences
  • Research: Study the learned structure of synthetic DNA

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)

# Generate a plasmid with kanamycin resistance and ColE1 origin
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True)
sequence = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(sequence)

Input Format

<BOS><TOKEN1><TOKEN2>...<SEP>

The model generates DNA bases (A/T/C/G) after the <SEP> token until it produces <EOS> or hits the maximum length.

Component Categories

Category Examples Count
Antibiotic Resistance (AMR) Kanamycin, Ampicillin, Chloramphenicol, ... 11
Origin of Replication (ORI) ColE1, F1, P15A, pSC101, SV40, ... 7
Promoter (PROM) CMV, T7, U6, EF1a, CAG, ... 11
Reporter EGFP, mCherry, YFP, NanoLuc, ... 6
Vector Type (VEC) Lentiviral, CRISPR, Bacterial, AAV, ... 10
Other Tags, elements, species, backbones ~55

Limitations

  • This is a pretrained base model -- it learns sequence statistics but has not been optimized for motif placement accuracy. Post-training with RL significantly improves functional element fidelity.
  • Generated sequences are not experimentally validated. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
  • The model was trained on Addgene plasmids, which are biased toward commonly deposited vectors (mammalian expression, bacterial cloning, CRISPR).
  • Maximum context of 16K tokens (~16 kbp), which covers most but not all plasmids.

Citation

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation},
  author={Thiel, McClain},
  year={2026},
  url={https://huggingface.co/McClain/PlasmidLM}
}
Downloads last month
134
Safetensors
Model size
17.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results