anuj2054 commited on
Commit
09fad71
Β·
verified Β·
1 Parent(s): 00c75b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -1
README.md CHANGED
@@ -12,4 +12,243 @@ sequence-level realism.
12
 
13
  - **Developed by:** Nanil Therapeutics Inc.
14
  - **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences
15
- - **License:** Free for research use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  - **Developed by:** Nanil Therapeutics Inc.
14
  - **Model type:** Transformer-based generative language model for protein-coding DNA/mRNA sequences
15
+ - **License:** Free for research use
16
+
17
+
18
+ # CodonGPT Quickstart Guide
19
+
20
+ ## Overview
21
+
22
+ CodonGPT is a transformer-based generative language model specifically designed for protein-coding DNA/mRNA sequences. Developed by Nanil Therapeutics Inc., it generates codon-level sequences with biological awareness and synonymous structure understanding.
23
+
24
+ ## Key Features
25
+
26
+ - **Codon-aware sequence design**: Trained on Ensembl CDS sequences with GPT-2 architecture
27
+ - **Synonymous structure learning**: Understands CAI/GC biases and genetic patterns
28
+ - **Custom tokenizer**: Processes sequences at the codon level (3-nucleotide chunks)
29
+ - **SynonymousLogitProcessor**: Enables biologically equivalent alternative generation
30
+ - **Research license**: Free for research use
31
+
32
+ ## Installation
33
+
34
+ ```bash
35
+ # Install dependencies - Note: torch 2.6+ required for security reasons
36
+ pip install torch==2.6.0 transformers biopython huggingface_hub
37
+ ```
38
+
39
+ **Download custom components**: Since CodonGPT uses custom tokenizer and logits processor, you need to download these files:
40
+
41
+ ```python
42
+ from huggingface_hub import hf_hub_download
43
+
44
+ # Download custom tokenizer and processor
45
+ hf_hub_download(repo_id="naniltx/codonGPT", filename="tokenizer.py", local_dir="./")
46
+ hf_hub_download(repo_id="naniltx/codonGPT", filename="synonymous_logit_processor.py", local_dir="./")
47
+ ```
48
+
49
+ **Alternative**: Download manually from https://huggingface.co/naniltx/codonGPT
50
+
51
+ ## Quick Start
52
+
53
+ ### 1. Load the Model and Components
54
+
55
+ ```python
56
+ import torch
57
+ from transformers import GPT2LMHeadModel
58
+
59
+ # Import custom components (downloaded above)
60
+ from tokenizer import CodonTokenizer
61
+ from synonymous_logit_processor import SynonymMaskingLogitsProcessor
62
+
63
+ # Load model directly from Hugging Face
64
+ model = GPT2LMHeadModel.from_pretrained("naniltx/codonGPT")
65
+ model.eval()
66
+
67
+ # Load custom tokenizer
68
+ tokenizer = CodonTokenizer()
69
+ ```
70
+
71
+ ### 2. Basic Sequence Generation
72
+
73
+ ```python
74
+ # Example: Generate codon sequence
75
+ input_sequence = "ATGAAACCC" # Sample DNA sequence (must be multiple of 3)
76
+
77
+ # Tokenize input (codon-level tokenization)
78
+ input_codons = [input_sequence[i:i+3] for i in range(0, len(input_sequence), 3)]
79
+ input_tokens = [tokenizer.bos_token_id] + tokenizer.convert_tokens_to_ids(input_codons)
80
+ input_tensor = torch.tensor([input_tokens])
81
+
82
+ # Generate with the model
83
+ with torch.no_grad():
84
+ outputs = model.generate(
85
+ input_tensor,
86
+ max_length=input_tensor.size(1) + 10, # Generate 10 more codons
87
+ temperature=1.0,
88
+ do_sample=True,
89
+ pad_token_id=tokenizer.pad_token_id,
90
+ eos_token_id=tokenizer.eos_token_id
91
+ )
92
+
93
+ # Decode results
94
+ generated_tokens = outputs[0][input_tensor.size(1):].tolist() # Remove input part
95
+ generated_codons = [tokenizer.decode([token_id]) for token_id in generated_tokens
96
+ if token_id not in [tokenizer.pad_token_id, tokenizer.eos_token_id]]
97
+ generated_sequence = ''.join(generated_codons)
98
+ print(f"Input sequence: {input_sequence}")
99
+ print(f"Generated sequence: {generated_sequence}")
100
+ ```
101
+
102
+ ### 3. Synonym-Aware Generation
103
+
104
+ ```python
105
+ from synonymous_logit_processor import generate_candidate_codons_with_generate
106
+ from Bio.Seq import Seq
107
+
108
+ # Generate synonymous alternatives for a sequence
109
+ # The function includes the human genetic code by default
110
+ initial_codons = ["ATG", "AAA", "CCC"] # Example codons
111
+
112
+ # Generate optimized codons with synonym-aware decoding
113
+ optimized_codons = generate_candidate_codons_with_generate(
114
+ initial_codons,
115
+ model=model,
116
+ tokenizer=tokenizer,
117
+ temperature=1.0,
118
+ top_k=50,
119
+ top_p=0.9
120
+ )
121
+
122
+ print(f"Original: {initial_codons}")
123
+ print(f"Optimized: {optimized_codons}")
124
+
125
+ # Verify amino acid sequences are preserved
126
+ original_aa = ''.join([str(Seq(codon).translate()) for codon in initial_codons])
127
+ optimized_aa = ''.join([str(Seq(codon).translate()) for codon in optimized_codons])
128
+ print(f"Original AA: {original_aa}")
129
+ print(f"Optimized AA: {optimized_aa}")
130
+ print(f"AA preserved: {original_aa == optimized_aa}")
131
+ ```
132
+
133
+ #### Using Custom Genetic Code
134
+
135
+ ```python
136
+ # If you need a custom genetic code mapping
137
+ custom_aa_to_codon = {
138
+ 'M': ['ATG'], 'K': ['AAA'], 'P': ['CCC'] # Simplified example
139
+ # ... add your custom mappings
140
+ }
141
+
142
+ optimized_codons_custom = generate_candidate_codons_with_generate(
143
+ initial_codons,
144
+ model=model,
145
+ tokenizer=tokenizer,
146
+ aa_to_codon=custom_aa_to_codon,
147
+ temperature=1.0
148
+ )
149
+ ```
150
+
151
+ ### 4. Advanced Usage with Custom Constraints
152
+
153
+ ```python
154
+ # Custom generation with specific amino acid constraints
155
+ def generate_with_aa_constraint(target_aa_sequence, model, tokenizer, aa_to_codon=None):
156
+ """Generate codon sequence for specific amino acid sequence"""
157
+ from synonymous_logit_processor import SynonymMaskingLogitsProcessor, aa_to_codon_human
158
+
159
+ if aa_to_codon is None:
160
+ aa_to_codon = aa_to_codon_human
161
+
162
+ generated_codons = []
163
+ current_tokens = [tokenizer.bos_token_id]
164
+
165
+ for aa in target_aa_sequence:
166
+ # Create processor for current amino acid
167
+ processor = SynonymMaskingLogitsProcessor(aa, tokenizer, aa_to_codon)
168
+
169
+ # Generate next codon
170
+ input_ids = torch.tensor([current_tokens])
171
+ output = model.generate(
172
+ input_ids,
173
+ max_length=len(current_tokens) + 1,
174
+ logits_processor=[processor],
175
+ do_sample=True,
176
+ temperature=1.0,
177
+ pad_token_id=tokenizer.pad_token_id
178
+ )
179
+
180
+ # Extract and store codon
181
+ next_token = output[0][-1].item()
182
+ codon = tokenizer.decode([next_token])
183
+ generated_codons.append(codon)
184
+ current_tokens.append(next_token)
185
+
186
+ return generated_codons
187
+
188
+ # Example usage
189
+ aa_sequence = "MKP" # Methionine-Lysine-Proline
190
+ codons = generate_with_aa_constraint(aa_sequence, model, tokenizer)
191
+ print(f"AA sequence: {aa_sequence}")
192
+ print(f"Generated codons: {codons}")
193
+ print(f"DNA sequence: {''.join(codons)}")
194
+
195
+ # Verify the translation
196
+ from Bio.Seq import Seq
197
+ generated_dna = ''.join(codons)
198
+ translated_aa = str(Seq(generated_dna).translate())
199
+ print(f"Verification - translated AA: {translated_aa}")
200
+ print(f"Match: {aa_sequence == translated_aa}")
201
+ ```
202
+
203
+ ## Model Architecture
204
+
205
+ - **Base**: GPT-2 decoder architecture
206
+ - **Vocabulary**: 67 tokens (64 codons + 3 special tokens: [PAD], [BOS], [EOS])
207
+ - **Tokenization**: Codon-level (3 nucleotides per token)
208
+ - **Training**: Pretrained on Ensembl CDS sequences
209
+
210
+ ## Use Cases
211
+
212
+ 1. **Codon optimization**: Generate alternative codon sequences with preserved amino acid sequence
213
+ 2. **Sequence design**: Create biologically realistic DNA/mRNA sequences
214
+ 3. **Synthetic biology**: Design sequences with specific CAI/GC content properties
215
+ 4. **Research**: Study codon usage patterns and genetic biases
216
+
217
+ ## Important Notes
218
+
219
+ - Input sequences must be multiples of 3 nucleotides (complete codons)
220
+ - Model generates at codon-level granularity
221
+ - Custom tokenizer and processor are essential for proper functionality
222
+ - Model is optimized for research use cases
223
+
224
+ ## Files Structure
225
+
226
+ ```
227
+ codonGPT/
228
+ β”œβ”€β”€ config.json # Model configuration
229
+ β”œβ”€β”€ generation_config.json # Generation parameters
230
+ β”œβ”€β”€ pytorch_model.bin # Model weights
231
+ β”œβ”€β”€ tokenizer.py # Custom codon tokenizer
232
+ └── synonymous_logit_processor.py # Synonym-aware processor
233
+ ```
234
+
235
+ ## Citation
236
+
237
+ If you use CodonGPT in your research, please cite:
238
+
239
+ ```
240
+ @misc{codonGPT,
241
+ title={CodonGPT: Transformer-based Codon-aware Sequence Generation},
242
+ author={Nanil Therapeutics Inc.},
243
+ year={2024},
244
+ url={https://huggingface.co/naniltx/codonGPT}
245
+ }
246
+ ```
247
+
248
+ ## License
249
+
250
+ Free for research use. For commercial applications, please contact Nanil Therapeutics Inc.
251
+
252
+ ## Support
253
+
254
+ For questions and issues, please refer to the Hugging Face model page or contact the developers.