PlasmidLM / README.md
McClain's picture
Upload folder using huggingface_hub
ee76406 verified
---
library_name: transformers
license: apache-2.0
tags:
- biology
- genomics
- plasmid
- dna
- causal-lm
- synthetic-biology
language:
- en
pipeline_tag: text-generation
---
# PlasmidLM
A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene.
## Model Details
| Property | Value |
|---|---|
| Parameters | 17.7M |
| Architecture | Transformer decoder (dense MLP), LLaMA-style |
| Hidden size | 384 |
| Layers | 10 |
| Attention heads | 8 |
| Intermediate size | 1,536 |
| Max sequence length | 16,384 tokens |
| Tokenizer | Character-level (single DNA bases) |
| Vocab size | 120 |
### Training
- **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
- **Steps**: 15,000
- **Eval loss**: 0.093
- **Token accuracy**: 96.1%
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))
```
The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.
## Input Format
```
<BOS><TOKEN1><TOKEN2>...<SEP>
```
The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.
## Special Tokens
| Token | Purpose |
|---|---|
| `<BOS>` | Beginning of sequence |
| `<EOS>` | End of sequence |
| `<SEP>` | Separator between prompt annotations and DNA sequence |
| `<PAD>` | Padding |
| `<AMR_*>` | Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) |
| `<ORI_*>` | Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) |
| `<PROM_*>` | Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) |
| `<REP_*>` | Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) |
## Related Models
- [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense
- [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts
## Limitations
- This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
- Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis.
- Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
- Maximum context of 16K tokens (~16 kbp).
## Citation
```bibtex
@misc{thiel2026plasmidlm,
title={PlasmidLM: Language Models for Plasmid DNA Generation},
author={Thiel, McClain},
year={2026}
}
```