Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

PlasmidLM-kmer6

A 19.3M parameter autoregressive language model for plasmid DNA sequence generation, trained on ~100K plasmid sequences from Addgene.

Model Details

Property	Value
Parameters	19.3M
Architecture	Transformer decoder (dense MLP)
Hidden size	384
Layers	10
Attention heads	8
Intermediate size	1,536
Max sequence length	16,384 tokens
Tokenizer	k-mer (k=6, stride=3)
Vocab size	4,208

Training

Data: ~100K plasmid sequences from Addgene, tokenized with k-mer (k=6, stride=3)
Steps: 65,000
Eval loss: 0.129
Token accuracy: 97.4%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM-kmer6", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM-kmer6", trust_remote_code=True)

# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))

The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication) provided as special tokens in the prompt.

Special Tokens

Token	Purpose
`<BOS>`	Beginning of sequence
`<EOS>`	End of sequence
`<SEP>`	Separator between prompt annotations and DNA sequence
`<PAD>`	Padding
`<AMR_*>`	Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`)
`<ORI_*>`	Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`)

Citation

If you use this model, please cite:

@misc{thiel2026plasmidlm,
  title={PlasmidLM: Language Models for Plasmid DNA Generation},
  author={Thiel, McClain},
  year={2026}
}

Downloads last month: 16

Safetensors

Model size

19.3M params

Tensor type

F32

Model tree for McClain/PlasmidLM-kmer6

Finetunes

1 model