PlasmidGPT-SFT / README.md
McClain's picture
Upload 7 files
ecaaa86 verified

PlasmidGPT Model

This is a GPT-2 based model for engineered plasmid sequence generation, converted from PyTorch .pt format to HuggingFace transformers format.

This is a supervised fine-tuned (SFT) version of PlasmidGPT for engineered plasmids. This work was done by Angus Cunningham while at Prof. Chris Barnes' lab at UCL.

Model Details

  • Architecture: GPT-2
  • Vocab Size: 30,002
  • Hidden Size: 768
  • Number of Layers: 12
  • Number of Heads: 12
  • Max Position Embeddings: 2048
  • Parameters: ~124M

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./plasmidgpt-model")
tokenizer = AutoTokenizer.from_pretrained("./plasmidgpt-model")

# Basic generation
inputs = tokenizer("ATGC", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_sequence)

# With sampling (for more diverse outputs)
outputs = model.generate(**inputs, max_length=100, do_sample=True, temperature=0.8, top_p=0.9)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_sequence)

Example Outputs

Input: ATGCGATCG
Generated: ATGCGATCGGTGGTAGGCACTGGATGATGGCCCTGCAGTGTAGCCGTAGTTATGAGCCTCGGGATTCTTTGATGATTCAGCCACCCTCATCATCCTCCTCCTCC...

Input: ATGGCC
Generated: ATGGCCTACATACCTTCAATTACCGAAACAAGGTGGTTCATCTCTAACGCTGTCCATAAAACCGCCCAGTCTAGCTATCGCCATTTGCGCATCTAACGTGGTAGGCACTCCGGGTCCGCGCC...

Compatible With

This model is compatible with the architecture from McClain/plasmidgpt-addgene-gpt2, but with different weights from the pretrained model.

Files

  • config.json: Model configuration
  • generation_config.json: Generation parameters
  • model.safetensors: Model weights in SafeTensors format
  • tokenizer.json: Fast tokenizer data
  • tokenizer_config.json: Tokenizer configuration
  • special_tokens_map.json: Special token mappings