--- library_name: transformers tags: [] --- ### Model Description This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2–style decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon- aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving sequence-level realism. - **Developed by:** Nanil Therapeutics Inc. - **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences - **License:** Free for research use # CodonGPT Quickstart Guide ## Overview CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding. ## Key Features - **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture - **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns - **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks) - **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation - **Research license**: Free for research use ## Installation ```bash # Install dependencies - Note: torch 2.6+ required for security reasons pip install torch==2.6.0 transformers biopython huggingface_hub ``` **Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files: ```python from huggingface_hub import hf_hub_download # Download custom tokenizer and processor hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./") hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./") ``` **Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT ## Quick Start ### 1. Load the Model and Components ```python import torch from transformers import GPT2LMHeadModel # Import custom components (downloaded above) from tokenizer import CodonTokenizer from synonymous_logit_processor import SynonymMaskingLogitsProcessor # Load model directly from Hugging Face model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT") model.eval() # Load custom tokenizer tokenizer = CodonTokenizer() ``` ### 2. Basic Sequence Generation ```python # Example: Generate codon sequence input_sequence = "ATGAAACCC" # Sample DNA sequence (must be multiple of 3) # Tokenize input (codon-level tokenization) input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)] input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons) input_tensor = torch.tensor([input_tokens]) # Generate with the model with torch.no_grad(): outputs = model.generate( input_tensor, max_length=input_tensor.size(1) + 10, # Generate 10 more codons temperature=1.0, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) # Decode results generated_tokens = outputs[0][input_tensor.size(1):].tolist() # Remove input part generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]] generated_sequence = ''.join(generated_codons) print(f"Input sequence: {input_sequence}") print(f"Generated sequence: {generated_sequence}") ``` ### 3. Synonym-Aware Generation ```python from synonymous_logit_processor import generate_candidate_codons_with_generate from Bio.Seq import Seq # Generate synonymous alternatives for a sequence # The function includes the human genetic code by default initial_codons = ["ATG", "AAA", "CCC"] # Example codons # Generate optimized codons with synonym-aware decoding optimized_codons = generate_candidate_codons_with_generate( initial_codons, model=model, tokenizer=tokenizer, temperature=1.0, top_k=50, top_p=0.9 ) print(f"Original: {initial_codons}") print(f"Optimized: {optimized_codons}") # Verify amino acid sequences are preserved original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons]) optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons]) print(f"Original AA: {original_aa}") print(f"Optimized AA: {optimized_aa}") print(f"AA preserved: {original_aa == optimized_aa}") ``` #### Using Custom Genetic Code ```python # If you need a custom genetic code mapping custom_aa_to_codon = { 'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC'] # Simplified example # ... add your custom mappings } optimized_codons_custom = generate_candidate_codons_with_generate( initial_codons, model=model, tokenizer=tokenizer, aa_to_codon=custom_aa_to_codon, temperature=1.0 ) ``` ### 4. Advanced Usage with Custom Constraints ```python # Custom generation with specific amino acid constraints def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None): """Generate codon sequence for specific amino acid sequence""" from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human if aa_to_codon is None: aa_to_codon = aa_to_codon_human generated_codons = [] current_tokens = [tokenizer.bos_token_id] for aa in target_aa_sequence: # Create processor for current amino acid processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon) # Generate next codon input_ids = torch.tensor([current_tokens]) output = model.generate( input_ids, max_length=len(current_tokens) + 1, logits_processor=[processor], do_sample=True, temperature=1.0, pad_token_id=tokenizer.pad_token_id ) # Extract and store codon next_token = output[0][-1].item() codon = tokenizer.decode([next_token]) generated_codons.append(codon) current_tokens.append(next_token) return generated_codons # Example usage aa_sequence = "MKP" # Methionine-Lysine-Proline codons = generate_with_aa_constraint(aa_sequence, model, tokenizer) print(f"AA sequence: {aa_sequence}") print(f"Generated codons: {codons}") print(f"DNA sequence: {''.join(codons)}") # Verify the translation from Bio.Seq import Seq generated_dna = ''.join(codons) translated_aa = str(Seq(generated_dna).translate()) print(f"Verification - translated AA: {translated_aa}") print(f"Match: {aa_sequence == translated_aa}") ``` ## Model Architecture - **Base**: GPT-2 decoder architecture - **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS]) - **Tokenization**: Codon-level (3 nucleotides per token) - **Training**: Pretrained on Ensembl CDS sequences ## Use Cases 1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence 2. **Sequence design**: Create biologically realistic DNA/mRNA sequences 3. **Synthetic biology**: Design sequences with specific CAI/GC content properties 4. **Research**: Study codon usage patterns and genetic biases ## Important Notes - Input sequences must be multiples of 3 nucleotides (complete codons) - Model generates at codon-level granularity - Custom tokenizer and processor are essential for proper functionality - Model is optimized for research use cases ## Files Structure ``` codonGPT/ ├── config.json # Model configuration ├── generation_config.json # Generation parameters ├── pytorch_model.bin # Model weights ├── tokenizer.py # Custom codon tokenizer └── synonymous_logit_processor.py # Synonym-aware processor ``` ## Citation If you use CodonGPT in your research, please cite: ```bibtex @article{rajbanshi2025codongpt, title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints}, author={Rajbanshi, Binita and Guruacharya, Anuj}, journal={bioRxiv}, year={2025}, doi={10.1101/2025.06.25.661500}, url={https://doi.org/10.1101/2025.06.25.661500} } ``` ## License Free for research use. For commercial applications, please contact Nanil Therapeutics Inc. ## Support For questions and issues, please refer to the Hugging Face model page or contact the developers.