Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
PlasmidLM-kmer6-MoE
A 78.3M total parameter (31.1M active) Mixture-of-Experts autoregressive language model for plasmid DNA sequence generation, trained on ~100K plasmid sequences from Addgene.
Model Details
| Property | Value |
|---|---|
| Total parameters | 78.3M |
| Active parameters | 31.1M |
| Architecture | Transformer decoder with MoE MLP |
| Hidden size | 384 |
| Layers | 10 |
| Attention heads | 8 |
| Experts | 6 (top-2 routing) |
| Expert intermediate size | 1,536 |
| Max sequence length | 16,384 tokens |
| Tokenizer | k-mer (k=6, stride=3) |
| Vocab size | 4,208 |
Training
- Data: ~100K plasmid sequences from Addgene, tokenized with k-mer (k=6, stride=3)
- Steps: 35,000
- Eval loss: 0.190
- Token accuracy: 98.4%
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM-kmer6-MoE", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM-kmer6-MoE", trust_remote_code=True)
# Condition on antibiotic resistance + origin of replication
prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.8, do_sample=True, top_p=0.95)
print(tokenizer.decode(outputs[0].tolist()))
The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication) provided as special tokens in the prompt.
MoE Architecture
Each transformer layer replaces the standard dense MLP with a Mixture-of-Experts layer containing 6 expert MLPs. A learned router selects the top-2 experts per token, so only 31.1M of the 78.3M total parameters are active for any given token. This provides greater model capacity while maintaining efficient inference.
Special Tokens
| Token | Purpose |
|---|---|
<BOS> |
Beginning of sequence |
<EOS> |
End of sequence |
<SEP> |
Separator between prompt annotations and DNA sequence |
<PAD> |
Padding |
<AMR_*> |
Antibiotic resistance markers (e.g., <AMR_KANAMYCIN>, <AMR_AMPICILLIN>) |
<ORI_*> |
Origins of replication (e.g., <ORI_COLE1>, <ORI_P15A>) |
Citation
If you use this model, please cite:
@misc{thiel2026plasmidlm,
title={PlasmidLM: Language Models for Plasmid DNA Generation},
author={Thiel, McClain},
year={2026}
}
- Downloads last month
- -