| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | tags: |
| | - biology |
| | - genomics |
| | - plasmid |
| | - dna |
| | - causal-lm |
| | - synthetic-biology |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # PlasmidLM |
| |
|
| | A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene. |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Parameters | 17.7M | |
| | | Architecture | Transformer decoder (dense MLP), LLaMA-style | |
| | | Hidden size | 384 | |
| | | Layers | 10 | |
| | | Attention heads | 8 | |
| | | Intermediate size | 1,536 | |
| | | Max sequence length | 16,384 tokens | |
| | | Tokenizer | Character-level (single DNA bases) | |
| | | Vocab size | 120 | |
| |
|
| | ### Training |
| |
|
| | - **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate |
| | - **Steps**: 15,000 |
| | - **Eval loss**: 0.093 |
| | - **Token accuracy**: 96.1% |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True) |
| | |
| | # Condition on antibiotic resistance + origin of replication |
| | prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>" |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95) |
| | print(tokenizer.decode(outputs[0].tolist())) |
| | ``` |
| |
|
| | The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt. |
| |
|
| | ## Input Format |
| |
|
| | ``` |
| | <BOS><TOKEN1><TOKEN2>...<SEP> |
| | ``` |
| |
|
| | The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length. |
| |
|
| | ## Special Tokens |
| |
|
| | | Token | Purpose | |
| | |---|---| |
| | | `<BOS>` | Beginning of sequence | |
| | | `<EOS>` | End of sequence | |
| | | `<SEP>` | Separator between prompt annotations and DNA sequence | |
| | | `<PAD>` | Padding | |
| | | `<AMR_*>` | Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) | |
| | | `<ORI_*>` | Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) | |
| | | `<PROM_*>` | Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) | |
| | | `<REP_*>` | Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) | |
| |
|
| | ## Related Models |
| |
|
| | - [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense |
| | - [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts |
| |
|
| | ## Limitations |
| |
|
| | - This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity. |
| | - Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis. |
| | - Trained on Addgene plasmids, which are biased toward commonly deposited vectors. |
| | - Maximum context of 16K tokens (~16 kbp). |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{thiel2026plasmidlm, |
| | title={PlasmidLM: Language Models for Plasmid DNA Generation}, |
| | author={Thiel, McClain}, |
| | year={2026} |
| | } |
| | ``` |
| |
|