McClain
/

PlasmidLM

Text Generation

synthetic-biology

Model card Files Files and versions

PlasmidLM / README.md

McClain's picture

Upload folder using huggingface_hub

ee76406 verified about 12 hours ago

|

history blame contribute delete

3.19 kB

	---
	library_name: transformers
	license: apache-2.0
	tags:
	- biology
	- genomics
	- plasmid
	- dna
	- causal-lm
	- synthetic-biology
	language:
	- en
	pipeline_tag: text-generation
	---

	# PlasmidLM

	A 17.7M parameter autoregressive language model for plasmid DNA sequence generation, trained on ~108K plasmid sequences from Addgene.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 17.7M \|
	\| Architecture \| Transformer decoder (dense MLP), LLaMA-style \|
	\| Hidden size \| 384 \|
	\| Layers \| 10 \|
	\| Attention heads \| 8 \|
	\| Intermediate size \| 1,536 \|
	\| Max sequence length \| 16,384 tokens \|
	\| Tokenizer \| Character-level (single DNA bases) \|
	\| Vocab size \| 120 \|

	### Training

	- Data: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
	- Steps: 15,000
	- Eval loss: 0.093
	- Token accuracy: 96.1%

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)

	# Condition on antibiotic resistance + origin of replication
	prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
	print(tokenizer.decode(outputs[0].tolist()))
	```

	The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.

	## Input Format

	```
	<BOS><TOKEN1><TOKEN2>...<SEP>
	```

	The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.

	## Special Tokens

	\| Token \| Purpose \|
	\|---\|---\|
	\| `<BOS>` \| Beginning of sequence \|
	\| `<EOS>` \| End of sequence \|
	\| `<SEP>` \| Separator between prompt annotations and DNA sequence \|
	\| `<PAD>` \| Padding \|
	\| `<AMR_*>` \| Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) \|
	\| `<ORI_*>` \| Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) \|
	\| `<PROM_*>` \| Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) \|
	\| `<REP_*>` \| Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) \|

	## Related Models

	- [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense
	- [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts

	## Limitations

	- This is a pretrained base model -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
	- Generated sequences are not experimentally validated. Always verify computationally and experimentally before synthesis.
	- Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
	- Maximum context of 16K tokens (~16 kbp).

	## Citation

	```bibtex
	@misc{thiel2026plasmidlm,
	title={PlasmidLM: Language Models for Plasmid DNA Generation},
	author={Thiel, McClain},
	year={2026}
	}
	```