File size: 8,771 Bytes

---
library_name: transformers
tags: []
---

### Model Description

This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from
the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2–style decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon-
aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving
sequence-level realism.

- **Developed by:** Nanil Therapeutics Inc.
- **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences
- **License:** Free for research use


# CodonGPT Quickstart Guide

## Overview

CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.

## Key Features

- **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture
- **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns
- **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks)
- **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation
- **Research license**: Free for research use

## Installation

```bash
# Install dependencies - Note: torch 2.6+ required for security reasons
pip install torch==2.6.0 transformers biopython huggingface_hub
```

**Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:

```python
from huggingface_hub import hf_hub_download

# Download custom tokenizer and processor
hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")
```

**Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT

## Quick Start

### 1. Load the Model and Components

```python
import torch
from transformers import GPT2LMHeadModel

# Import custom components (downloaded above)
from tokenizer import CodonTokenizer
from synonymous_logit_processor import SynonymMaskingLogitsProcessor

# Load model directly from Hugging Face
model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
model.eval()

# Load custom tokenizer
tokenizer = CodonTokenizer()
```

### 2. Basic Sequence Generation

```python
# Example: Generate codon sequence
input_sequence = "ATGAAACCC"  # Sample DNA sequence (must be multiple of 3)

# Tokenize input (codon-level tokenization)
input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
input_tensor = torch.tensor([input_tokens])

# Generate with the model
with torch.no_grad():
    outputs = model.generate(
        input_tensor,
        max_length=input_tensor.size(1) + 10,  # Generate 10 more codons
        temperature=1.0,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode results
generated_tokens = outputs[0][input_tensor.size(1):].tolist()  # Remove input part
generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens 
                   if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
generated_sequence = ''.join(generated_codons)
print(f"Input sequence: {input_sequence}")
print(f"Generated sequence: {generated_sequence}")
```

### 3. Synonym-Aware Generation

```python
from synonymous_logit_processor import generate_candidate_codons_with_generate
from Bio.Seq import Seq

# Generate synonymous alternatives for a sequence
# The function includes the human genetic code by default
initial_codons = ["ATG", "AAA", "CCC"]  # Example codons

# Generate optimized codons with synonym-aware decoding
optimized_codons = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    temperature=1.0,
    top_k=50,
    top_p=0.9
)

print(f"Original: {initial_codons}")
print(f"Optimized: {optimized_codons}")

# Verify amino acid sequences are preserved
original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
print(f"Original AA: {original_aa}")
print(f"Optimized AA: {optimized_aa}")
print(f"AA preserved: {original_aa == optimized_aa}")
```

#### Using Custom Genetic Code

```python
# If you need a custom genetic code mapping
custom_aa_to_codon = {
    'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC']  # Simplified example
    # ... add your custom mappings
}

optimized_codons_custom = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    aa_to_codon=custom_aa_to_codon,
    temperature=1.0
)
```

### 4. Advanced Usage with Custom Constraints

```python
# Custom generation with specific amino acid constraints
def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
    """Generate codon sequence for specific amino acid sequence"""
    from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human
    
    if aa_to_codon is None:
        aa_to_codon = aa_to_codon_human
    
    generated_codons = []
    current_tokens = [tokenizer.bos_token_id]
    
    for aa in target_aa_sequence:
        # Create processor for current amino acid
        processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)
        
        # Generate next codon
        input_ids = torch.tensor([current_tokens])
        output = model.generate(
            input_ids,
            max_length=len(current_tokens) + 1,
            logits_processor=[processor],
            do_sample=True,
            temperature=1.0,
            pad_token_id=tokenizer.pad_token_id
        )
        
        # Extract and store codon
        next_token = output[0][-1].item()
        codon = tokenizer.decode([next_token])
        generated_codons.append(codon)
        current_tokens.append(next_token)
    
    return generated_codons

# Example usage
aa_sequence = "MKP"  # Methionine-Lysine-Proline
codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
print(f"AA sequence: {aa_sequence}")
print(f"Generated codons: {codons}")
print(f"DNA sequence: {''.join(codons)}")

# Verify the translation
from Bio.Seq import Seq
generated_dna = ''.join(codons)
translated_aa = str(Seq(generated_dna).translate())
print(f"Verification - translated AA: {translated_aa}")
print(f"Match: {aa_sequence == translated_aa}")
```

## Model Architecture

- **Base**: GPT-2 decoder architecture
- **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
- **Tokenization**: Codon-level (3 nucleotides per token)
- **Training**: Pretrained on Ensembl CDS sequences

## Use Cases

1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence
2. **Sequence design**: Create biologically realistic DNA/mRNA sequences
3. **Synthetic biology**: Design sequences with specific CAI/GC content properties
4. **Research**: Study codon usage patterns and genetic biases

## Important Notes

- Input sequences must be multiples of 3 nucleotides (complete codons)
- Model generates at codon-level granularity
- Custom tokenizer and processor are essential for proper functionality
- Model is optimized for research use cases

## Files Structure

```
codonGPT/
├── config.json              # Model configuration
├── generation_config.json   # Generation parameters
├── pytorch_model.bin        # Model weights
├── tokenizer.py            # Custom codon tokenizer
└── synonymous_logit_processor.py  # Synonym-aware processor
```

## Citation

If you use CodonGPT in your research, please cite:

```bibtex
@article{rajbanshi2025codongpt,
  title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints},
  author={Rajbanshi, Binita and Guruacharya, Anuj},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.06.25.661500},
  url={https://doi.org/10.1101/2025.06.25.661500}
}
```

## License

Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.

## Support

For questions and issues, please refer to the Hugging Face model page or contact the developers.