File size: 8,771 Bytes
8d80d34 3fe45b2 511d848 7e64247 8d80d34 ffc2258 5daf4e3 09fad71 73a4797 09fad71 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
---
library_name: transformers
tags: []
---
### Model Description
This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from
the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2βstyle decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon-
aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving
sequence-level realism.
- **Developed by:** Nanil Therapeutics Inc.
- **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences
- **License:** Free for research use
# CodonGPT Quickstart Guide
## Overview
CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.
## Key Features
- **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture
- **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns
- **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks)
- **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation
- **Research license**: Free for research use
## Installation
```bash
# Install dependencies - Note: torch 2.6+ required for security reasons
pip install torch==2.6.0 transformers biopython huggingface_hub
```
**Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:
```python
from huggingface_hub import hf_hub_download
# Download custom tokenizer and processor
hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")
```
**Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT
## Quick Start
### 1. Load the Model and Components
```python
import torch
from transformers import GPT2LMHeadModel
# Import custom components (downloaded above)
from tokenizer import CodonTokenizer
from synonymous_logit_processor import SynonymMaskingLogitsProcessor
# Load model directly from Hugging Face
model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
model.eval()
# Load custom tokenizer
tokenizer = CodonTokenizer()
```
### 2. Basic Sequence Generation
```python
# Example: Generate codon sequence
input_sequence = "ATGAAACCC" # Sample DNA sequence (must be multiple of 3)
# Tokenize input (codon-level tokenization)
input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
input_tensor = torch.tensor([input_tokens])
# Generate with the model
with torch.no_grad():
outputs = model.generate(
input_tensor,
max_length=input_tensor.size(1) + 10, # Generate 10 more codons
temperature=1.0,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
# Decode results
generated_tokens = outputs[0][input_tensor.size(1):].tolist() # Remove input part
generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens
if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
generated_sequence = ''.join(generated_codons)
print(f"Input sequence: {input_sequence}")
print(f"Generated sequence: {generated_sequence}")
```
### 3. Synonym-Aware Generation
```python
from synonymous_logit_processor import generate_candidate_codons_with_generate
from Bio.Seq import Seq
# Generate synonymous alternatives for a sequence
# The function includes the human genetic code by default
initial_codons = ["ATG", "AAA", "CCC"] # Example codons
# Generate optimized codons with synonym-aware decoding
optimized_codons = generate_candidate_codons_with_generate(
initial_codons,
model=model,
tokenizer=tokenizer,
temperature=1.0,
top_k=50,
top_p=0.9
)
print(f"Original: {initial_codons}")
print(f"Optimized: {optimized_codons}")
# Verify amino acid sequences are preserved
original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
print(f"Original AA: {original_aa}")
print(f"Optimized AA: {optimized_aa}")
print(f"AA preserved: {original_aa == optimized_aa}")
```
#### Using Custom Genetic Code
```python
# If you need a custom genetic code mapping
custom_aa_to_codon = {
'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC'] # Simplified example
# ... add your custom mappings
}
optimized_codons_custom = generate_candidate_codons_with_generate(
initial_codons,
model=model,
tokenizer=tokenizer,
aa_to_codon=custom_aa_to_codon,
temperature=1.0
)
```
### 4. Advanced Usage with Custom Constraints
```python
# Custom generation with specific amino acid constraints
def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
"""Generate codon sequence for specific amino acid sequence"""
from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human
if aa_to_codon is None:
aa_to_codon = aa_to_codon_human
generated_codons = []
current_tokens = [tokenizer.bos_token_id]
for aa in target_aa_sequence:
# Create processor for current amino acid
processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)
# Generate next codon
input_ids = torch.tensor([current_tokens])
output = model.generate(
input_ids,
max_length=len(current_tokens) + 1,
logits_processor=[processor],
do_sample=True,
temperature=1.0,
pad_token_id=tokenizer.pad_token_id
)
# Extract and store codon
next_token = output[0][-1].item()
codon = tokenizer.decode([next_token])
generated_codons.append(codon)
current_tokens.append(next_token)
return generated_codons
# Example usage
aa_sequence = "MKP" # Methionine-Lysine-Proline
codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
print(f"AA sequence: {aa_sequence}")
print(f"Generated codons: {codons}")
print(f"DNA sequence: {''.join(codons)}")
# Verify the translation
from Bio.Seq import Seq
generated_dna = ''.join(codons)
translated_aa = str(Seq(generated_dna).translate())
print(f"Verification - translated AA: {translated_aa}")
print(f"Match: {aa_sequence == translated_aa}")
```
## Model Architecture
- **Base**: GPT-2 decoder architecture
- **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
- **Tokenization**: Codon-level (3 nucleotides per token)
- **Training**: Pretrained on Ensembl CDS sequences
## Use Cases
1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence
2. **Sequence design**: Create biologically realistic DNA/mRNA sequences
3. **Synthetic biology**: Design sequences with specific CAI/GC content properties
4. **Research**: Study codon usage patterns and genetic biases
## Important Notes
- Input sequences must be multiples of 3 nucleotides (complete codons)
- Model generates at codon-level granularity
- Custom tokenizer and processor are essential for proper functionality
- Model is optimized for research use cases
## Files Structure
```
codonGPT/
βββ config.json # Model configuration
βββ generation_config.json # Generation parameters
βββ pytorch_model.bin # Model weights
βββ tokenizer.py # Custom codon tokenizer
βββ synonymous_logit_processor.py # Synonym-aware processor
```
## Citation
If you use CodonGPT in your research, please cite:
```bibtex
@article{rajbanshi2025codongpt,
title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints},
author={Rajbanshi, Binita and Guruacharya, Anuj},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.06.25.661500},
url={https://doi.org/10.1101/2025.06.25.661500}
}
```
## License
Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.
## Support
For questions and issues, please refer to the Hugging Face model page or contact the developers. |