codonGPT / README.md

anuj2054

Update README.md

3fe45b2 verified 3 months ago

8.77 kB

	---
	library_name: transformers
	tags: []
	---

	### Model Description

	This repository ships the CodonGPT model checkpoint together with its Custom codon-level Tokenizer and the Custom SynonymousLogitProcessor, so you can reproduce the constrained generation workflow straight from
	the model card. The model was pretrained on Ensembl CDS sequences with a GPT-2–style decoder, learns synonymous structure and CAI/GC biases, and is optimized for codon-
	aware sequence design. After pulling the snapshot, load the tokenizer and processor from the repo files to enable synonym-aware decoding that encourages biologically equivalent alternatives while preserving
	sequence-level realism.

	- Developed by: Nanil Therapeutics Inc.
	- Model type: Transformer-based generative language model for protein-coding DNA/mRNA sequences
	- License: Free for research use


	# CodonGPT Quickstart Guide

	## Overview

	CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.

	## Key Features

	- Codon-aware sequence design: Trained on Ensembl CDS sequences with GPT-2 architecture
	- Synonymous structure learning: Understands CAI/GC biases and genetic patterns
	- Custom tokenizer: Processes sequences at the codon level (3-nucleotide chunks)
	- SynonymousLogitProcessor: Enables biologically equivalent alternative generation
	- Research license: Free for research use

	## Installation

	```bash
	# Install dependencies - Note: torch 2.6+ required for security reasons
	pip install torch==2.6.0 transformers biopython huggingface_hub
	```

	Download custom components: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:

	```python
	from huggingface_hub import hf_hub_download

	# Download custom tokenizer and processor
	hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
	hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")
	```

	Alternative: Download manually from https://huggingface.co/naniltx/codonGPT

	## Quick Start

	### 1. Load the Model and Components

	```python
	import torch
	from transformers import GPT2LMHeadModel

	# Import custom components (downloaded above)
	from tokenizer import CodonTokenizer
	from synonymous_logit_processor import SynonymMaskingLogitsProcessor

	# Load model directly from Hugging Face
	model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
	model.eval()

	# Load custom tokenizer
	tokenizer = CodonTokenizer()
	```

	### 2. Basic Sequence Generation

	```python
	# Example: Generate codon sequence
	input_sequence = "ATGAAACCC" # Sample DNA sequence (must be multiple of 3)

	# Tokenize input (codon-level tokenization)
	input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
	input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
	input_tensor = torch.tensor([input_tokens])

	# Generate with the model
	with torch.no_grad():
	outputs = model.generate(
	input_tensor,
	max_length=input_tensor.size(1) + 10, # Generate 10 more codons
	temperature=1.0,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id
	)

	# Decode results
	generated_tokens = outputs[0][input_tensor.size(1):].tolist() # Remove input part
	generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens
	if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
	generated_sequence = ''.join(generated_codons)
	print(f"Input sequence: {input_sequence}")
	print(f"Generated sequence: {generated_sequence}")
	```

	### 3. Synonym-Aware Generation

	```python
	from synonymous_logit_processor import generate_candidate_codons_with_generate
	from Bio.Seq import Seq

	# Generate synonymous alternatives for a sequence
	# The function includes the human genetic code by default
	initial_codons = ["ATG", "AAA", "CCC"] # Example codons

	# Generate optimized codons with synonym-aware decoding
	optimized_codons = generate_candidate_codons_with_generate(
	initial_codons,
	model=model,
	tokenizer=tokenizer,
	temperature=1.0,
	top_k=50,
	top_p=0.9
	)

	print(f"Original: {initial_codons}")
	print(f"Optimized: {optimized_codons}")

	# Verify amino acid sequences are preserved
	original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
	optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
	print(f"Original AA: {original_aa}")
	print(f"Optimized AA: {optimized_aa}")
	print(f"AA preserved: {original_aa == optimized_aa}")
	```

	#### Using Custom Genetic Code

	```python
	# If you need a custom genetic code mapping
	custom_aa_to_codon = {
	'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC'] # Simplified example
	# ... add your custom mappings
	}

	optimized_codons_custom = generate_candidate_codons_with_generate(
	initial_codons,
	model=model,
	tokenizer=tokenizer,
	aa_to_codon=custom_aa_to_codon,
	temperature=1.0
	)
	```

	### 4. Advanced Usage with Custom Constraints

	```python
	# Custom generation with specific amino acid constraints
	def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
	"""Generate codon sequence for specific amino acid sequence"""
	from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human

	if aa_to_codon is None:
	aa_to_codon = aa_to_codon_human

	generated_codons = []
	current_tokens = [tokenizer.bos_token_id]

	for aa in target_aa_sequence:
	# Create processor for current amino acid
	processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)

	# Generate next codon
	input_ids = torch.tensor([current_tokens])
	output = model.generate(
	input_ids,
	max_length=len(current_tokens) + 1,
	logits_processor=[processor],
	do_sample=True,
	temperature=1.0,
	pad_token_id=tokenizer.pad_token_id
	)

	# Extract and store codon
	next_token = output[0][-1].item()
	codon = tokenizer.decode([next_token])
	generated_codons.append(codon)
	current_tokens.append(next_token)

	return generated_codons

	# Example usage
	aa_sequence = "MKP" # Methionine-Lysine-Proline
	codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
	print(f"AA sequence: {aa_sequence}")
	print(f"Generated codons: {codons}")
	print(f"DNA sequence: {''.join(codons)}")

	# Verify the translation
	from Bio.Seq import Seq
	generated_dna = ''.join(codons)
	translated_aa = str(Seq(generated_dna).translate())
	print(f"Verification - translated AA: {translated_aa}")
	print(f"Match: {aa_sequence == translated_aa}")
	```

	## Model Architecture

	- Base: GPT-2 decoder architecture
	- Vocabulary: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
	- Tokenization: Codon-level (3 nucleotides per token)
	- Training: Pretrained on Ensembl CDS sequences

	## Use Cases

	1. Codon optimization: Generate alternative codon sequences with preserved amino acid sequence
	2. Sequence design: Create biologically realistic DNA/mRNA sequences
	3. Synthetic biology: Design sequences with specific CAI/GC content properties
	4. Research: Study codon usage patterns and genetic biases

	## Important Notes

	- Input sequences must be multiples of 3 nucleotides (complete codons)
	- Model generates at codon-level granularity
	- Custom tokenizer and processor are essential for proper functionality
	- Model is optimized for research use cases

	## Files Structure

	```
	codonGPT/
	├── config.json # Model configuration
	├── generation_config.json # Generation parameters
	├── pytorch_model.bin # Model weights
	├── tokenizer.py # Custom codon tokenizer
	└── synonymous_logit_processor.py # Synonym-aware processor
	```

	## Citation

	If you use CodonGPT in your research, please cite:

	```bibtex
	@article{rajbanshi2025codongpt,
	title={codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints},
	author={Rajbanshi, Binita and Guruacharya, Anuj},
	journal={bioRxiv},
	year={2025},
	doi={10.1101/2025.06.25.661500},
	url={https://doi.org/10.1101/2025.06.25.661500}
	}
	```

	## License

	Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.

	## Support

	For questions and issues, please refer to the Hugging Face model page or contact the developers.