Upload folder using huggingface_hub

ee76406 verified about 9 hours ago

3.19 kB

library_name: transformers
license: apache-2.0
tags:
  - biology
  - genomics
  - plasmid
  - dna
  - causal-lm
  - synthetic-biology
language:
  - en
pipeline_tag: text-generation

PlasmidLM

A 17.7M parameter autoregressive language model for plasmid DNA sequence generation, trained on ~108K plasmid sequences from Addgene.

Model Details

Property	Value
Parameters	17.7M
Architecture	Transformer decoder (dense MLP), LLaMA-style
Hidden size	384
Layers	10
Attention heads	8
Intermediate size	1,536
Max sequence length	16,384 tokens
Tokenizer	Character-level (single DNA bases)
Vocab size	120

Training

Data: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
Steps: 15,000
Eval loss: 0.093
Token accuracy: 96.1%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)

# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))

The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.

Input Format

<BOS><TOKEN1><TOKEN2>...<SEP>

The model generates DNA bases (A/T/C/G) after the <SEP> token until it produces <EOS> or hits the maximum length.

Special Tokens

Token	Purpose
`<BOS>`	Beginning of sequence
`<EOS>`	End of sequence
`<SEP>`	Separator between prompt annotations and DNA sequence
`<PAD>`	Padding
`<AMR_*>`	Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`)
`<ORI_*>`	Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`)
`<PROM_*>`	Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`)
`<REP_*>`	Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`)

Related Models

McClain/PlasmidLM-kmer6 — kmer6 tokenizer, 19.3M params, dense
McClain/PlasmidLM-kmer6-MoE — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts

Limitations

This is a pretrained base model -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
Generated sequences are not experimentally validated. Always verify computationally and experimentally before synthesis.
Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
Maximum context of 16K tokens (~16 kbp).

Citation

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Plasmid DNA Generation},
  author={Thiel, McClain},
  year={2026}
}