|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from |
|
|
the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2βstyle decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon- |
|
|
aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving |
|
|
sequence-level realism. |
|
|
|
|
|
- **Developed by:** Nanil Therapeutics Inc. |
|
|
- **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences |
|
|
- **License:** Free for research use |
|
|
|
|
|
|
|
|
# CodonGPT Quickstart Guide |
|
|
|
|
|
## Overview |
|
|
|
|
|
CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture |
|
|
- **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns |
|
|
- **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks) |
|
|
- **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation |
|
|
- **Research license**: Free for research use |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
# Install dependencies - Note: torch 2.6+ required for security reasons |
|
|
pip install torch==2.6.0 transformers biopython huggingface_hub |
|
|
``` |
|
|
|
|
|
**Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files: |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Download custom tokenizer and processor |
|
|
hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./") |
|
|
hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./") |
|
|
``` |
|
|
|
|
|
**Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### 1. Load the Model and Components |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import GPT2LMHeadModel |
|
|
|
|
|
# Import custom components (downloaded above) |
|
|
from tokenizer import CodonTokenizer |
|
|
from synonymous_logit_processor import SynonymMaskingLogitsProcessor |
|
|
|
|
|
# Load model directly from Hugging Face |
|
|
model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT") |
|
|
model.eval() |
|
|
|
|
|
# Load custom tokenizer |
|
|
tokenizer = CodonTokenizer() |
|
|
``` |
|
|
|
|
|
### 2. Basic Sequence Generation |
|
|
|
|
|
```python |
|
|
# Example: Generate codon sequence |
|
|
input_sequence = "ATGAAACCC" # Sample DNA sequence (must be multiple of 3) |
|
|
|
|
|
# Tokenize input (codon-level tokenization) |
|
|
input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)] |
|
|
input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons) |
|
|
input_tensor = torch.tensor([input_tokens]) |
|
|
|
|
|
# Generate with the model |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
input_tensor, |
|
|
max_length=input_tensor.size(1) + 10, # Generate 10 more codons |
|
|
temperature=1.0, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.pad_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# Decode results |
|
|
generated_tokens = outputs[0][input_tensor.size(1):].tolist() # Remove input part |
|
|
generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens |
|
|
if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]] |
|
|
generated_sequence = ''.join(generated_codons) |
|
|
print(f"Input sequence: {input_sequence}") |
|
|
print(f"Generated sequence: {generated_sequence}") |
|
|
``` |
|
|
|
|
|
### 3. Synonym-Aware Generation |
|
|
|
|
|
```python |
|
|
from synonymous_logit_processor import generate_candidate_codons_with_generate |
|
|
from Bio.Seq import Seq |
|
|
|
|
|
# Generate synonymous alternatives for a sequence |
|
|
# The function includes the human genetic code by default |
|
|
initial_codons = ["ATG", "AAA", "CCC"] # Example codons |
|
|
|
|
|
# Generate optimized codons with synonym-aware decoding |
|
|
optimized_codons = generate_candidate_codons_with_generate( |
|
|
initial_codons, |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
temperature=1.0, |
|
|
top_k=50, |
|
|
top_p=0.9 |
|
|
) |
|
|
|
|
|
print(f"Original: {initial_codons}") |
|
|
print(f"Optimized: {optimized_codons}") |
|
|
|
|
|
# Verify amino acid sequences are preserved |
|
|
original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons]) |
|
|
optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons]) |
|
|
print(f"Original AA: {original_aa}") |
|
|
print(f"Optimized AA: {optimized_aa}") |
|
|
print(f"AA preserved: {original_aa == optimized_aa}") |
|
|
``` |
|
|
|
|
|
#### Using Custom Genetic Code |
|
|
|
|
|
```python |
|
|
# If you need a custom genetic code mapping |
|
|
custom_aa_to_codon = { |
|
|
'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC'] # Simplified example |
|
|
# ... add your custom mappings |
|
|
} |
|
|
|
|
|
optimized_codons_custom = generate_candidate_codons_with_generate( |
|
|
initial_codons, |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
aa_to_codon=custom_aa_to_codon, |
|
|
temperature=1.0 |
|
|
) |
|
|
``` |
|
|
|
|
|
### 4. Advanced Usage with Custom Constraints |
|
|
|
|
|
```python |
|
|
# Custom generation with specific amino acid constraints |
|
|
def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None): |
|
|
"""Generate codon sequence for specific amino acid sequence""" |
|
|
from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human |
|
|
|
|
|
if aa_to_codon is None: |
|
|
aa_to_codon = aa_to_codon_human |
|
|
|
|
|
generated_codons = [] |
|
|
current_tokens = [tokenizer.bos_token_id] |
|
|
|
|
|
for aa in target_aa_sequence: |
|
|
# Create processor for current amino acid |
|
|
processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon) |
|
|
|
|
|
# Generate next codon |
|
|
input_ids = torch.tensor([current_tokens]) |
|
|
output = model.generate( |
|
|
input_ids, |
|
|
max_length=len(current_tokens) + 1, |
|
|
logits_processor=[processor], |
|
|
do_sample=True, |
|
|
temperature=1.0, |
|
|
pad_token_id=tokenizer.pad_token_id |
|
|
) |
|
|
|
|
|
# Extract and store codon |
|
|
next_token = output[0][-1].item() |
|
|
codon = tokenizer.decode([next_token]) |
|
|
generated_codons.append(codon) |
|
|
current_tokens.append(next_token) |
|
|
|
|
|
return generated_codons |
|
|
|
|
|
# Example usage |
|
|
aa_sequence = "MKP" # Methionine-Lysine-Proline |
|
|
codons = generate_with_aa_constraint(aa_sequence, model, tokenizer) |
|
|
print(f"AA sequence: {aa_sequence}") |
|
|
print(f"Generated codons: {codons}") |
|
|
print(f"DNA sequence: {''.join(codons)}") |
|
|
|
|
|
# Verify the translation |
|
|
from Bio.Seq import Seq |
|
|
generated_dna = ''.join(codons) |
|
|
translated_aa = str(Seq(generated_dna).translate()) |
|
|
print(f"Verification - translated AA: {translated_aa}") |
|
|
print(f"Match: {aa_sequence == translated_aa}") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Base**: GPT-2 decoder architecture |
|
|
- **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS]) |
|
|
- **Tokenization**: Codon-level (3 nucleotides per token) |
|
|
- **Training**: Pretrained on Ensembl CDS sequences |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence |
|
|
2. **Sequence design**: Create biologically realistic DNA/mRNA sequences |
|
|
3. **Synthetic biology**: Design sequences with specific CAI/GC content properties |
|
|
4. **Research**: Study codon usage patterns and genetic biases |
|
|
|
|
|
## Important Notes |
|
|
|
|
|
- Input sequences must be multiples of 3 nucleotides (complete codons) |
|
|
- Model generates at codon-level granularity |
|
|
- Custom tokenizer and processor are essential for proper functionality |
|
|
- Model is optimized for research use cases |
|
|
|
|
|
## Files Structure |
|
|
|
|
|
``` |
|
|
codonGPT/ |
|
|
βββ config.json # Model configuration |
|
|
βββ generation_config.json # Generation parameters |
|
|
βββ pytorch_model.bin # Model weights |
|
|
βββ tokenizer.py # Custom codon tokenizer |
|
|
βββ synonymous_logit_processor.py # Synonym-aware processor |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use CodonGPT in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{rajbanshi2025codongpt, |
|
|
title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints}, |
|
|
author={Rajbanshi, Binita and Guruacharya, Anuj}, |
|
|
journal={bioRxiv}, |
|
|
year={2025}, |
|
|
doi={10.1101/2025.06.25.661500}, |
|
|
url={https://doi.org/10.1101/2025.06.25.661500} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Free for research use. For commercial applications, please contact Nanil Therapeutics Inc. |
|
|
|
|
|
## Support |
|
|
|
|
|
For questions and issues, please refer to the Hugging Face model page or contact the developers. |