---
library_name: transformers
license: apache-2.0
tags:
  - biology
  - genomics
  - plasmid
  - dna
  - causal-lm
  - synthetic-biology
language:
  - en
pipeline_tag: text-generation
---

# PlasmidLM

A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene.

## Model Details

| Property | Value |
|---|---|
| Parameters | 17.7M |
| Architecture | Transformer decoder (dense MLP), LLaMA-style |
| Hidden size | 384 |
| Layers | 10 |
| Attention heads | 8 |
| Intermediate size | 1,536 |
| Max sequence length | 16,384 tokens |
| Tokenizer | Character-level (single DNA bases) |
| Vocab size | 120 |

### Training

- **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
- **Steps**: 15,000
- **Eval loss**: 0.093
- **Token accuracy**: 96.1%

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)

# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))
```

The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.

## Input Format

```
<BOS><TOKEN1><TOKEN2>...<SEP>
```

The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.

## Special Tokens

| Token | Purpose |
|---|---|
| `<BOS>` | Beginning of sequence |
| `<EOS>` | End of sequence |
| `<SEP>` | Separator between prompt annotations and DNA sequence |
| `<PAD>` | Padding |
| `<AMR_*>` | Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) |
| `<ORI_*>` | Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) |
| `<PROM_*>` | Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) |
| `<REP_*>` | Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) |

## Related Models

- [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense
- [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts

## Limitations

- This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
- Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis.
- Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
- Maximum context of 16K tokens (~16 kbp).

## Citation

```bibtex
@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Plasmid DNA Generation},
  author={Thiel, McClain},
  year={2026}
}
```