--- library_name: transformers license: apache-2.0 tags: - biology - genomics - plasmid - dna - causal-lm - synthetic-biology language: - en pipeline_tag: text-generation --- # PlasmidLM A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene. ## Model Details | Property | Value | |---|---| | Parameters | 17.7M | | Architecture | Transformer decoder (dense MLP), LLaMA-style | | Hidden size | 384 | | Layers | 10 | | Attention heads | 8 | | Intermediate size | 1,536 | | Max sequence length | 16,384 tokens | | Tokenizer | Character-level (single DNA bases) | | Vocab size | 120 | ### Training - **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate - **Steps**: 15,000 - **Eval loss**: 0.093 - **Token accuracy**: 96.1% ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True) # Condition on antibiotic resistance + origin of replication prompt = "" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95) print(tokenizer.decode(outputs[0].tolist())) ``` The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt. ## Input Format ``` ... ``` The model generates DNA bases (A/T/C/G) after the `` token until it produces `` or hits the maximum length. ## Special Tokens | Token | Purpose | |---|---| | `` | Beginning of sequence | | `` | End of sequence | | `` | Separator between prompt annotations and DNA sequence | | `` | Padding | | `` | Antibiotic resistance markers (e.g., ``, ``) | | `` | Origins of replication (e.g., ``, ``) | | `` | Promoters (e.g., ``, ``) | | `` | Reporters (e.g., ``, ``) | ## Related Models - [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense - [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts ## Limitations - This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity. - Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis. - Trained on Addgene plasmids, which are biased toward commonly deposited vectors. - Maximum context of 16K tokens (~16 kbp). ## Citation ```bibtex @misc{thiel2026plasmidlm, title={PlasmidLM: Language Models for Plasmid DNA Generation}, author={Thiel, McClain}, year={2026} } ```