PlasmidGPT-244M
A conditional DNA language model for generating plasmid sequences based on specified biological properties.
Model Description
PlasmidGPT is a GPT-2 style transformer trained to generate plasmid DNA sequences conditioned on biological properties like host organism, antibiotic resistance, GC content, and more.
- Parameters: 244M
- Context Length: 16,384 tokens
- Architecture: 18 layers, 16 heads, 1024 embedding dim
- Vocabulary: 72 tokens (special tokens + condition tokens + ACGT nucleotides)
Training
- Dataset: Deduplicated Addgene plasmid sequences (13,260 unique prompt-sequence pairs)
- Training Steps: 10,000
- Best Validation Loss: 0.721
- Best Validation Perplexity: 2.06
- WandB Run: xotxj25a
Conditioning Tokens
The model accepts the following condition tokens:
| Category | Tokens |
|---|---|
| Host | <HOST:ECOLI>, <HOST:HUMAN>, <HOST:MAMMALIAN>, <HOST:MOUSE>, <HOST:PLANT>, <HOST:RAT>, <HOST:SYNTHETIC>, <HOST:WORM>, <HOST:YEAST> |
| Resistance | <RESISTANCE:AMP>, <RESISTANCE:KAN>, <RESISTANCE:SPEC>, <RESISTANCE:CHLOR>, <RESISTANCE:GENT>, <RESISTANCE:STREP>, <RESISTANCE:TET> |
| Length | <LENGTH:SHORT>, <LENGTH:MEDIUM>, <LENGTH:LONG> |
| GC Content | <GC:LOW>, <GC:MEDIUM>, <GC:HIGH> |
| Application | <APPLICATION:CLONING>, <APPLICATION:CRISPR>, <APPLICATION:EDITING>, <APPLICATION:EXPRESSION>, <APPLICATION:RECOMBINATION>, <APPLICATION:REPORTER>, <APPLICATION:RNAI> |
| Copy Number | <COPY:HIGH>, <COPY:LOW> |
| Promoter | <PROMOTER:BAD>, <PROMOTER:CAG>, <PROMOTER:CBH>, <PROMOTER:CMV>, <PROMOTER:HSYN>, <PROMOTER:LAC>, <PROMOTER:PGK>, <PROMOTER:POLYH>, <PROMOTER:SFFV>, <PROMOTER:TAC>, <PROMOTER:TRE> |
| Vector Type | <VECTOR:AAV>, <VECTOR:LENTIVIRAL>, <VECTOR:RETROVIRAL>, <VECTOR:TRANSPOSON> |
| Tags | <TAG:FLAG>, <TAG:GFP>, <TAG:GST>, <TAG:HA>, <TAG:HIS>, <TAG:MBP>, <TAG:MCHERRY>, <TAG:MYC>, <TAG:NLS>, <TAG:SNAP> |
Usage
from transformers import GPT2LMHeadModel
import torch
# Load model
model = GPT2LMHeadModel.from_pretrained("mcclain/plasmid-gpt-244m")
model.eval()
# Load vocab for encoding
import json
with open("vocab.json") as f:
vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}
# Encode a prompt
def encode(text):
tokens = [vocab["<BOS>"]]
import re
pos = 0
for match in re.finditer(r'<[A-Z_]+:[A-Z_]+>|<[A-Z]+>', text):
for char in text[pos:match.start()]:
tokens.append(vocab.get(char, vocab["<UNK>"]))
tokens.append(vocab.get(match.group(), vocab["<UNK>"]))
pos = match.end()
for char in text[pos:]:
tokens.append(vocab.get(char, vocab["<UNK>"]))
return tokens
def decode(token_ids):
return "".join(id_to_token.get(t, "<UNK>") for t in token_ids)
# Generate
prompt = "<HOST:ECOLI><RESISTANCE:AMP><LENGTH:MEDIUM><GC:MEDIUM><SEQ>"
input_ids = torch.tensor([encode(prompt)])
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=500,
do_sample=True,
temperature=0.85,
top_k=50,
repetition_penalty=1.15,
pad_token_id=0,
eos_token_id=2,
)
generated = decode(output[0, input_ids.shape[1]:].tolist())
print(generated.replace("<EOS>", ""))
Related Models
- plasmid-gpt-319m - Larger model with better perplexity (1.88)
Citation
If you use this model, please cite:
@misc{plasmidgpt2024,
title={PlasmidGPT: Conditional DNA Language Model for Plasmid Sequence Generation},
author={McClain},
year={2024},
publisher={HuggingFace}
}
License
MIT License
- Downloads last month
- 33