File size: 8,771 Bytes
8d80d34
 
 
 
 
 
 
3fe45b2
511d848
7e64247
 
8d80d34
ffc2258
5daf4e3
09fad71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73a4797
 
 
 
 
 
 
 
09fad71
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
library_name: transformers
tags: []
---

### Model Description

This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from
the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2–style decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon-
aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving
sequence-level realism.

- **Developed by:** Nanil Therapeutics Inc.
- **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences
- **License:** Free for research use


# CodonGPT Quickstart Guide

## Overview

CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.

## Key Features

- **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture
- **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns
- **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks)
- **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation
- **Research license**: Free for research use

## Installation

```bash
# Install dependencies - Note: torch 2.6+ required for security reasons
pip install torch==2.6.0 transformers biopython huggingface_hub
```

**Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:

```python
from huggingface_hub import hf_hub_download

# Download custom tokenizer and processor
hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")
```

**Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT

## Quick Start

### 1. Load the Model and Components

```python
import torch
from transformers import GPT2LMHeadModel

# Import custom components (downloaded above)
from tokenizer import CodonTokenizer
from synonymous_logit_processor import SynonymMaskingLogitsProcessor

# Load model directly from Hugging Face
model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
model.eval()

# Load custom tokenizer
tokenizer = CodonTokenizer()
```

### 2. Basic Sequence Generation

```python
# Example: Generate codon sequence
input_sequence = "ATGAAACCC"  # Sample DNA sequence (must be multiple of 3)

# Tokenize input (codon-level tokenization)
input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
input_tensor = torch.tensor([input_tokens])

# Generate with the model
with torch.no_grad():
    outputs = model.generate(
        input_tensor,
        max_length=input_tensor.size(1) + 10,  # Generate 10 more codons
        temperature=1.0,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode results
generated_tokens = outputs[0][input_tensor.size(1):].tolist()  # Remove input part
generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens 
                   if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
generated_sequence = ''.join(generated_codons)
print(f"Input sequence: {input_sequence}")
print(f"Generated sequence: {generated_sequence}")
```

### 3. Synonym-Aware Generation

```python
from synonymous_logit_processor import generate_candidate_codons_with_generate
from Bio.Seq import Seq

# Generate synonymous alternatives for a sequence
# The function includes the human genetic code by default
initial_codons = ["ATG", "AAA", "CCC"]  # Example codons

# Generate optimized codons with synonym-aware decoding
optimized_codons = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    temperature=1.0,
    top_k=50,
    top_p=0.9
)

print(f"Original: {initial_codons}")
print(f"Optimized: {optimized_codons}")

# Verify amino acid sequences are preserved
original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
print(f"Original AA: {original_aa}")
print(f"Optimized AA: {optimized_aa}")
print(f"AA preserved: {original_aa == optimized_aa}")
```

#### Using Custom Genetic Code

```python
# If you need a custom genetic code mapping
custom_aa_to_codon = {
    'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC']  # Simplified example
    # ... add your custom mappings
}

optimized_codons_custom = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    aa_to_codon=custom_aa_to_codon,
    temperature=1.0
)
```

### 4. Advanced Usage with Custom Constraints

```python
# Custom generation with specific amino acid constraints
def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
    """Generate codon sequence for specific amino acid sequence"""
    from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human
    
    if aa_to_codon is None:
        aa_to_codon = aa_to_codon_human
    
    generated_codons = []
    current_tokens = [tokenizer.bos_token_id]
    
    for aa in target_aa_sequence:
        # Create processor for current amino acid
        processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)
        
        # Generate next codon
        input_ids = torch.tensor([current_tokens])
        output = model.generate(
            input_ids,
            max_length=len(current_tokens) + 1,
            logits_processor=[processor],
            do_sample=True,
            temperature=1.0,
            pad_token_id=tokenizer.pad_token_id
        )
        
        # Extract and store codon
        next_token = output[0][-1].item()
        codon = tokenizer.decode([next_token])
        generated_codons.append(codon)
        current_tokens.append(next_token)
    
    return generated_codons

# Example usage
aa_sequence = "MKP"  # Methionine-Lysine-Proline
codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
print(f"AA sequence: {aa_sequence}")
print(f"Generated codons: {codons}")
print(f"DNA sequence: {''.join(codons)}")

# Verify the translation
from Bio.Seq import Seq
generated_dna = ''.join(codons)
translated_aa = str(Seq(generated_dna).translate())
print(f"Verification - translated AA: {translated_aa}")
print(f"Match: {aa_sequence == translated_aa}")
```

## Model Architecture

- **Base**: GPT-2 decoder architecture
- **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
- **Tokenization**: Codon-level (3 nucleotides per token)
- **Training**: Pretrained on Ensembl CDS sequences

## Use Cases

1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence
2. **Sequence design**: Create biologically realistic DNA/mRNA sequences
3. **Synthetic biology**: Design sequences with specific CAI/GC content properties
4. **Research**: Study codon usage patterns and genetic biases

## Important Notes

- Input sequences must be multiples of 3 nucleotides (complete codons)
- Model generates at codon-level granularity
- Custom tokenizer and processor are essential for proper functionality
- Model is optimized for research use cases

## Files Structure

```
codonGPT/
β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ generation_config.json   # Generation parameters
β”œβ”€β”€ pytorch_model.bin        # Model weights
β”œβ”€β”€ tokenizer.py            # Custom codon tokenizer
└── synonymous_logit_processor.py  # Synonym-aware processor
```

## Citation

If you use CodonGPT in your research, please cite:

```bibtex
@article{rajbanshi2025codongpt,
  title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints},
  author={Rajbanshi, Binita and Guruacharya, Anuj},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.06.25.661500},
  url={https://doi.org/10.1101/2025.06.25.661500}
}
```

## License

Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.

## Support

For questions and issues, please refer to the Hugging Face model page or contact the developers.