Model Description

This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2–style decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon- aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving sequence-level realism.

  • Developed by: Nanil Therapeutics Inc.
  • Model type: Transformer-based generative language model for protein-coding DNA/mRNA sequences
  • License: Free for research use

CodonGPT Quickstart Guide

Overview

CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.

Key Features

  • Codon-aware sequence design: Trained on Ensembl CDS sequences with GPT-2 architecture
  • Synonymous structure learning: Understands CAI/GC biases and genetic patterns
  • Custom tokenizer: Processes sequences at the codon level (3-nucleotide chunks)
  • SynonymousLogitProcessor: Enables biologically equivalent alternative generation
  • Research license: Free for research use

Installation

# Install dependencies - Note: torch 2.6+ required for security reasons
pip install torch==2.6.0 transformers biopython huggingface_hub

Download custom components: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:

from huggingface_hub import hf_hub_download

# Download custom tokenizer and processor
hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")

Alternative: Download manually from https://huggingface.co/naniltx/codonGPT

Quick Start

1. Load the Model and Components

import torch
from transformers import GPT2LMHeadModel

# Import custom components (downloaded above)
from tokenizer import CodonTokenizer
from synonymous_logit_processor import SynonymMaskingLogitsProcessor

# Load model directly from Hugging Face
model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
model.eval()

# Load custom tokenizer
tokenizer = CodonTokenizer()

2. Basic Sequence Generation

# Example: Generate codon sequence
input_sequence = "ATGAAACCC"  # Sample DNA sequence (must be multiple of 3)

# Tokenize input (codon-level tokenization)
input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
input_tensor = torch.tensor([input_tokens])

# Generate with the model
with torch.no_grad():
    outputs = model.generate(
        input_tensor,
        max_length=input_tensor.size(1) + 10,  # Generate 10 more codons
        temperature=1.0,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode results
generated_tokens = outputs[0][input_tensor.size(1):].tolist()  # Remove input part
generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens 
                   if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
generated_sequence = ''.join(generated_codons)
print(f"Input sequence: {input_sequence}")
print(f"Generated sequence: {generated_sequence}")

3. Synonym-Aware Generation

from synonymous_logit_processor import generate_candidate_codons_with_generate
from Bio.Seq import Seq

# Generate synonymous alternatives for a sequence
# The function includes the human genetic code by default
initial_codons = ["ATG", "AAA", "CCC"]  # Example codons

# Generate optimized codons with synonym-aware decoding
optimized_codons = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    temperature=1.0,
    top_k=50,
    top_p=0.9
)

print(f"Original: {initial_codons}")
print(f"Optimized: {optimized_codons}")

# Verify amino acid sequences are preserved
original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
print(f"Original AA: {original_aa}")
print(f"Optimized AA: {optimized_aa}")
print(f"AA preserved: {original_aa == optimized_aa}")

Using Custom Genetic Code

# If you need a custom genetic code mapping
custom_aa_to_codon = {
    'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC']  # Simplified example
    # ... add your custom mappings
}

optimized_codons_custom = generate_candidate_codons_with_generate(
    initial_codons,
    model=model,
    tokenizer=tokenizer,
    aa_to_codon=custom_aa_to_codon,
    temperature=1.0
)

4. Advanced Usage with Custom Constraints

# Custom generation with specific amino acid constraints
def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
    """Generate codon sequence for specific amino acid sequence"""
    from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human
    
    if aa_to_codon is None:
        aa_to_codon = aa_to_codon_human
    
    generated_codons = []
    current_tokens = [tokenizer.bos_token_id]
    
    for aa in target_aa_sequence:
        # Create processor for current amino acid
        processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)
        
        # Generate next codon
        input_ids = torch.tensor([current_tokens])
        output = model.generate(
            input_ids,
            max_length=len(current_tokens) + 1,
            logits_processor=[processor],
            do_sample=True,
            temperature=1.0,
            pad_token_id=tokenizer.pad_token_id
        )
        
        # Extract and store codon
        next_token = output[0][-1].item()
        codon = tokenizer.decode([next_token])
        generated_codons.append(codon)
        current_tokens.append(next_token)
    
    return generated_codons

# Example usage
aa_sequence = "MKP"  # Methionine-Lysine-Proline
codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
print(f"AA sequence: {aa_sequence}")
print(f"Generated codons: {codons}")
print(f"DNA sequence: {''.join(codons)}")

# Verify the translation
from Bio.Seq import Seq
generated_dna = ''.join(codons)
translated_aa = str(Seq(generated_dna).translate())
print(f"Verification - translated AA: {translated_aa}")
print(f"Match: {aa_sequence == translated_aa}")

Model Architecture

  • Base: GPT-2 decoder architecture
  • Vocabulary: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
  • Tokenization: Codon-level (3 nucleotides per token)
  • Training: Pretrained on Ensembl CDS sequences

Use Cases

  1. Codon optimization: Generate alternative codon sequences with preserved amino acid sequence
  2. Sequence design: Create biologically realistic DNA/mRNA sequences
  3. Synthetic biology: Design sequences with specific CAI/GC content properties
  4. Research: Study codon usage patterns and genetic biases

Important Notes

  • Input sequences must be multiples of 3 nucleotides (complete codons)
  • Model generates at codon-level granularity
  • Custom tokenizer and processor are essential for proper functionality
  • Model is optimized for research use cases

Files Structure

codonGPT/
β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ generation_config.json   # Generation parameters
β”œβ”€β”€ pytorch_model.bin        # Model weights
β”œβ”€β”€ tokenizer.py            # Custom codon tokenizer
└── synonymous_logit_processor.py  # Synonym-aware processor

Citation

If you use CodonGPT in your research, please cite:

@article{rajbanshi2025codongpt,
  title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints},
  author={Rajbanshi, Binita and Guruacharya, Anuj},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.06.25.661500},
  url={https://doi.org/10.1101/2025.06.25.661500}
}

License

Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.

Support

For questions and issues, please refer to the Hugging Face model page or contact the developers.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support