PlasmidGPT-244M

A conditional DNA language model for generating plasmid sequences based on specified biological properties.

Model Description

PlasmidGPT is a GPT-2 style transformer trained to generate plasmid DNA sequences conditioned on biological properties like host organism, antibiotic resistance, GC content, and more.

  • Parameters: 244M
  • Context Length: 16,384 tokens
  • Architecture: 18 layers, 16 heads, 1024 embedding dim
  • Vocabulary: 72 tokens (special tokens + condition tokens + ACGT nucleotides)

Training

  • Dataset: Deduplicated Addgene plasmid sequences (13,260 unique prompt-sequence pairs)
  • Training Steps: 10,000
  • Best Validation Loss: 0.721
  • Best Validation Perplexity: 2.06
  • WandB Run: xotxj25a

Conditioning Tokens

The model accepts the following condition tokens:

Category Tokens
Host <HOST:ECOLI>, <HOST:HUMAN>, <HOST:MAMMALIAN>, <HOST:MOUSE>, <HOST:PLANT>, <HOST:RAT>, <HOST:SYNTHETIC>, <HOST:WORM>, <HOST:YEAST>
Resistance <RESISTANCE:AMP>, <RESISTANCE:KAN>, <RESISTANCE:SPEC>, <RESISTANCE:CHLOR>, <RESISTANCE:GENT>, <RESISTANCE:STREP>, <RESISTANCE:TET>
Length <LENGTH:SHORT>, <LENGTH:MEDIUM>, <LENGTH:LONG>
GC Content <GC:LOW>, <GC:MEDIUM>, <GC:HIGH>
Application <APPLICATION:CLONING>, <APPLICATION:CRISPR>, <APPLICATION:EDITING>, <APPLICATION:EXPRESSION>, <APPLICATION:RECOMBINATION>, <APPLICATION:REPORTER>, <APPLICATION:RNAI>
Copy Number <COPY:HIGH>, <COPY:LOW>
Promoter <PROMOTER:BAD>, <PROMOTER:CAG>, <PROMOTER:CBH>, <PROMOTER:CMV>, <PROMOTER:HSYN>, <PROMOTER:LAC>, <PROMOTER:PGK>, <PROMOTER:POLYH>, <PROMOTER:SFFV>, <PROMOTER:TAC>, <PROMOTER:TRE>
Vector Type <VECTOR:AAV>, <VECTOR:LENTIVIRAL>, <VECTOR:RETROVIRAL>, <VECTOR:TRANSPOSON>
Tags <TAG:FLAG>, <TAG:GFP>, <TAG:GST>, <TAG:HA>, <TAG:HIS>, <TAG:MBP>, <TAG:MCHERRY>, <TAG:MYC>, <TAG:NLS>, <TAG:SNAP>

Usage

from transformers import GPT2LMHeadModel
import torch

# Load model
model = GPT2LMHeadModel.from_pretrained("mcclain/plasmid-gpt-244m")
model.eval()

# Load vocab for encoding
import json
with open("vocab.json") as f:
    vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}

# Encode a prompt
def encode(text):
    tokens = [vocab["<BOS>"]]
    import re
    pos = 0
    for match in re.finditer(r'<[A-Z_]+:[A-Z_]+>|<[A-Z]+>', text):
        for char in text[pos:match.start()]:
            tokens.append(vocab.get(char, vocab["<UNK>"]))
        tokens.append(vocab.get(match.group(), vocab["<UNK>"]))
        pos = match.end()
    for char in text[pos:]:
        tokens.append(vocab.get(char, vocab["<UNK>"]))
    return tokens

def decode(token_ids):
    return "".join(id_to_token.get(t, "<UNK>") for t in token_ids)

# Generate
prompt = "<HOST:ECOLI><RESISTANCE:AMP><LENGTH:MEDIUM><GC:MEDIUM><SEQ>"
input_ids = torch.tensor([encode(prompt)])

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.85,
        top_k=50,
        repetition_penalty=1.15,
        pad_token_id=0,
        eos_token_id=2,
    )

generated = decode(output[0, input_ids.shape[1]:].tolist())
print(generated.replace("<EOS>", ""))

Related Models

Citation

If you use this model, please cite:

@misc{plasmidgpt2024,
  title={PlasmidGPT: Conditional DNA Language Model for Plasmid Sequence Generation},
  author={McClain},
  year={2024},
  publisher={HuggingFace}
}

License

MIT License

Downloads last month
33
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support