File size: 3,185 Bytes

c8bd3a2
 
ee76406
c8bd3a2
ee76406
 
 
 
 
 
 
 
c8bd3a2
 
 
 
 
ee76406
c8bd3a2
ee76406
c8bd3a2
ee76406
 
 
 
c8bd3a2
 
 
ee76406
 
 
 
c8bd3a2
ee76406
c8bd3a2
ee76406
c8bd3a2
 
 
 
 
 
 
 
 
 
 
 
ee76406
c8bd3a2
 
ee76406
 
c8bd3a2
 
ee76406
 
c8bd3a2
 
 
 
 
 
 
 
ee76406
 
 
 
 
 
 
 
 
 
 
 
 
 
c8bd3a2
ee76406
 
c8bd3a2
 
 
ee76406
 
 
 
c8bd3a2
 
 
 
 
ee76406
c8bd3a2
ee76406
c8bd3a2

---
library_name: transformers
license: apache-2.0
tags:
  - biology
  - genomics
  - plasmid
  - dna
  - causal-lm
  - synthetic-biology
language:
  - en
pipeline_tag: text-generation
---

# PlasmidLM

A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene.

## Model Details

| Property | Value |
|---|---|
| Parameters | 17.7M |
| Architecture | Transformer decoder (dense MLP), LLaMA-style |
| Hidden size | 384 |
| Layers | 10 |
| Attention heads | 8 |
| Intermediate size | 1,536 |
| Max sequence length | 16,384 tokens |
| Tokenizer | Character-level (single DNA bases) |
| Vocab size | 120 |

### Training

- **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
- **Steps**: 15,000
- **Eval loss**: 0.093
- **Token accuracy**: 96.1%

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)

# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))
```

The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.

## Input Format

```
<BOS><TOKEN1><TOKEN2>...<SEP>
```

The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.

## Special Tokens

| Token | Purpose |
|---|---|
| `<BOS>` | Beginning of sequence |
| `<EOS>` | End of sequence |
| `<SEP>` | Separator between prompt annotations and DNA sequence |
| `<PAD>` | Padding |
| `<AMR_*>` | Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) |
| `<ORI_*>` | Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) |
| `<PROM_*>` | Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) |
| `<REP_*>` | Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) |

## Related Models

- [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense
- [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts

## Limitations

- This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
- Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis.
- Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
- Maximum context of 16K tokens (~16 kbp).

## Citation

```bibtex
@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Plasmid DNA Generation},
  author={Thiel, McClain},
  year={2026}
}
```