| # PlasmidGPT Model | |
| This is a GPT-2 based model for engineered plasmid sequence generation, converted from PyTorch `.pt` format to HuggingFace transformers format. | |
| This is a supervised fine-tuned (SFT) version of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) for engineered plasmids. This work was done by **Angus Cunningham** while at **Prof. Chris Barnes' lab at UCL**. | |
| ## Model Details | |
| - **Architecture**: GPT-2 | |
| - **Vocab Size**: 30,002 | |
| - **Hidden Size**: 768 | |
| - **Number of Layers**: 12 | |
| - **Number of Heads**: 12 | |
| - **Max Position Embeddings**: 2048 | |
| - **Parameters**: ~124M | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("./plasmidgpt-model") | |
| tokenizer = AutoTokenizer.from_pretrained("./plasmidgpt-model") | |
| # Basic generation | |
| inputs = tokenizer("ATGC", return_tensors="pt") | |
| outputs = model.generate(**inputs, max_length=100) | |
| generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(generated_sequence) | |
| # With sampling (for more diverse outputs) | |
| outputs = model.generate(**inputs, max_length=100, do_sample=True, temperature=0.8, top_p=0.9) | |
| generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(generated_sequence) | |
| ``` | |
| ### Example Outputs | |
| **Input:** `ATGCGATCG` | |
| **Generated:** `ATGCGATCGGTGGTAGGCACTGGATGATGGCCCTGCAGTGTAGCCGTAGTTATGAGCCTCGGGATTCTTTGATGATTCAGCCACCCTCATCATCCTCCTCCTCC...` | |
| **Input:** `ATGGCC` | |
| **Generated:** `ATGGCCTACATACCTTCAATTACCGAAACAAGGTGGTTCATCTCTAACGCTGTCCATAAAACCGCCCAGTCTAGCTATCGCCATTTGCGCATCTAACGTGGTAGGCACTCCGGGTCCGCGCC...` | |
| ## Compatible With | |
| This model is compatible with the architecture from [McClain/plasmidgpt-addgene-gpt2](https://huggingface.co/McClain/plasmidgpt-addgene-gpt2), but with different weights from the pretrained model. | |
| ## Files | |
| - `config.json`: Model configuration | |
| - `generation_config.json`: Generation parameters | |
| - `model.safetensors`: Model weights in SafeTensors format | |
| - `tokenizer.json`: Fast tokenizer data | |
| - `tokenizer_config.json`: Tokenizer configuration | |
| - `special_tokens_map.json`: Special token mappings | |