PlasmidGPT-SFT / README.md
McClain's picture
Upload 7 files
ecaaa86 verified
# PlasmidGPT Model
This is a GPT-2 based model for engineered plasmid sequence generation, converted from PyTorch `.pt` format to HuggingFace transformers format.
This is a supervised fine-tuned (SFT) version of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) for engineered plasmids. This work was done by **Angus Cunningham** while at **Prof. Chris Barnes' lab at UCL**.
## Model Details
- **Architecture**: GPT-2
- **Vocab Size**: 30,002
- **Hidden Size**: 768
- **Number of Layers**: 12
- **Number of Heads**: 12
- **Max Position Embeddings**: 2048
- **Parameters**: ~124M
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./plasmidgpt-model")
tokenizer = AutoTokenizer.from_pretrained("./plasmidgpt-model")
# Basic generation
inputs = tokenizer("ATGC", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_sequence)
# With sampling (for more diverse outputs)
outputs = model.generate(**inputs, max_length=100, do_sample=True, temperature=0.8, top_p=0.9)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_sequence)
```
### Example Outputs
**Input:** `ATGCGATCG`
**Generated:** `ATGCGATCGGTGGTAGGCACTGGATGATGGCCCTGCAGTGTAGCCGTAGTTATGAGCCTCGGGATTCTTTGATGATTCAGCCACCCTCATCATCCTCCTCCTCC...`
**Input:** `ATGGCC`
**Generated:** `ATGGCCTACATACCTTCAATTACCGAAACAAGGTGGTTCATCTCTAACGCTGTCCATAAAACCGCCCAGTCTAGCTATCGCCATTTGCGCATCTAACGTGGTAGGCACTCCGGGTCCGCGCC...`
## Compatible With
This model is compatible with the architecture from [McClain/plasmidgpt-addgene-gpt2](https://huggingface.co/McClain/plasmidgpt-addgene-gpt2), but with different weights from the pretrained model.
## Files
- `config.json`: Model configuration
- `generation_config.json`: Generation parameters
- `model.safetensors`: Model weights in SafeTensors format
- `tokenizer.json`: Fast tokenizer data
- `tokenizer_config.json`: Tokenizer configuration
- `special_tokens_map.json`: Special token mappings