Spaces:
Sleeping
Sleeping
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>ENCOT: Enhanced Codon Optimization Tool - Technical Documentation</title> | |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/atom-one-light.min.css"> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/python.min.js"></script> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/yaml.min.js"></script> | |
| <link href="https://fonts.googleapis.com/css2?family=Computer+Modern+Serif:wght@400;700&family=Computer+Modern+Sans:wght@400;700&family=Computer+Modern+Typewriter&display=swap" rel="stylesheet"> | |
| <style> | |
| /* LaTeX-inspired Academic Styling */ | |
| @import url('https://fonts.googleapis.com/css2?family=Crimson+Text:ital,wght@0,400;0,600;0,700;1,400&family=Source+Code+Pro:wght@400;500&display=swap'); | |
| * { | |
| margin: 0; | |
| padding: 0; | |
| box-sizing: border-box; | |
| } | |
| body { | |
| font-family: 'Crimson Text', 'Georgia', serif; | |
| line-height: 1.6; | |
| color: #2c3e50; | |
| background: #f8f9fa; | |
| padding: 40px; | |
| max-width: 900px; | |
| margin: 0 auto; | |
| font-size: 11pt; | |
| } | |
| /* Academic Paper Header */ | |
| .paper-header { | |
| text-align: center; | |
| margin-bottom: 50px; | |
| padding: 30px 0; | |
| border-bottom: 2px solid #2c3e50; | |
| } | |
| .paper-header h1 { | |
| font-size: 28pt; | |
| font-weight: 700; | |
| margin-bottom: 20px; | |
| color: #1a1a1a; | |
| letter-spacing: -0.5px; | |
| } | |
| .paper-header .subtitle { | |
| font-size: 14pt; | |
| font-style: italic; | |
| color: #555; | |
| margin-bottom: 25px; | |
| } | |
| .paper-header .authors { | |
| font-size: 11pt; | |
| color: #444; | |
| margin-bottom: 10px; | |
| } | |
| .paper-header .affiliation { | |
| font-size: 10pt; | |
| color: #666; | |
| font-style: italic; | |
| } | |
| /* Section Styling */ | |
| .section { | |
| margin: 40px 0; | |
| page-break-inside: avoid; | |
| background: white; | |
| padding: 25px; | |
| border: 1px solid #ddd; | |
| box-shadow: 0 1px 3px rgba(0,0,0,0.05); | |
| } | |
| .section-number { | |
| font-weight: 700; | |
| color: #2c3e50; | |
| font-size: 14pt; | |
| } | |
| .section-title { | |
| font-size: 16pt; | |
| font-weight: 700; | |
| color: #2c3e50; | |
| margin: 15px 0 20px 0; | |
| border-bottom: 1px solid #ccc; | |
| padding-bottom: 8px; | |
| } | |
| .abstract, .description { | |
| text-align: justify; | |
| margin: 15px 0; | |
| text-indent: 0; | |
| hyphens: auto; | |
| } | |
| .abstract { | |
| font-size: 10.5pt; | |
| padding: 15px; | |
| background: #f9f9f9; | |
| border-left: 3px solid #3498db; | |
| font-style: italic; | |
| } | |
| /* Code Blocks - LaTeX Listing Style */ | |
| .code-container { | |
| margin: 20px 0; | |
| border: 1px solid #ccc; | |
| background: #fafafa; | |
| } | |
| .code-header { | |
| background: #e8e8e8; | |
| padding: 8px 15px; | |
| border-bottom: 1px solid #ccc; | |
| font-family: 'Source Code Pro', monospace; | |
| font-size: 9pt; | |
| color: #555; | |
| } | |
| .listing-number { | |
| font-weight: 600; | |
| color: #2c3e50; | |
| } | |
| pre { | |
| margin: 0; | |
| padding: 15px; | |
| overflow-x: auto; | |
| background: white; | |
| border: none; | |
| } | |
| pre code { | |
| font-family: 'Source Code Pro', 'Courier New', monospace; | |
| font-size: 9pt; | |
| line-height: 1.4; | |
| color: #2c3e50; | |
| } | |
| /* Annotations and Highlights */ | |
| .annotation { | |
| background: #fff3cd; | |
| border-left: 4px solid #ffc107; | |
| padding: 12px 15px; | |
| margin: 15px 0; | |
| font-size: 10pt; | |
| } | |
| .annotation strong { | |
| color: #856404; | |
| } | |
| .key-concept { | |
| background: #d1ecf1; | |
| border-left: 4px solid #0c5460; | |
| padding: 12px 15px; | |
| margin: 15px 0; | |
| font-size: 10pt; | |
| } | |
| .mathematical { | |
| font-family: 'Crimson Text', serif; | |
| font-style: italic; | |
| text-align: center; | |
| padding: 15px; | |
| margin: 20px 0; | |
| background: #f9f9f9; | |
| border: 1px solid #ddd; | |
| font-size: 11pt; | |
| } | |
| /* File References */ | |
| .file-ref { | |
| font-family: 'Source Code Pro', monospace; | |
| font-size: 9pt; | |
| color: #2c3e50; | |
| background: #f4f4f4; | |
| padding: 8px 12px; | |
| border-left: 3px solid #3498db; | |
| margin: 15px 0; | |
| } | |
| .file-path { | |
| font-weight: 600; | |
| color: #2980b9; | |
| } | |
| /* Handwritten-style Notes */ | |
| .handwritten-note { | |
| border: 2px dashed #95a5a6; | |
| padding: 15px; | |
| margin: 20px 0; | |
| background: #fef9e7; | |
| font-size: 10pt; | |
| position: relative; | |
| } | |
| .handwritten-note::before { | |
| content: "βοΈ Important Note:"; | |
| font-weight: 600; | |
| color: #7f8c8d; | |
| display: block; | |
| margin-bottom: 8px; | |
| } | |
| /* Algorithm/Pseudocode Box */ | |
| .algorithm-box { | |
| border: 2px solid #2c3e50; | |
| padding: 20px; | |
| margin: 20px 0; | |
| background: white; | |
| } | |
| .algorithm-title { | |
| font-weight: 700; | |
| text-align: center; | |
| margin-bottom: 15px; | |
| font-size: 11pt; | |
| text-transform: uppercase; | |
| letter-spacing: 1px; | |
| } | |
| .algorithm-content { | |
| font-family: 'Source Code Pro', monospace; | |
| font-size: 9.5pt; | |
| line-height: 1.8; | |
| } | |
| /* Equation Styling */ | |
| .equation { | |
| text-align: center; | |
| margin: 25px 0; | |
| font-size: 12pt; | |
| font-family: 'Crimson Text', serif; | |
| } | |
| .equation-label { | |
| float: right; | |
| font-size: 10pt; | |
| color: #7f8c8d; | |
| } | |
| /* Table Styling */ | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 20px 0; | |
| font-size: 10pt; | |
| } | |
| th, td { | |
| border: 1px solid #bbb; | |
| padding: 8px 12px; | |
| text-align: left; | |
| } | |
| th { | |
| background: #ecf0f1; | |
| font-weight: 600; | |
| } | |
| /* Footer */ | |
| .footer { | |
| margin-top: 50px; | |
| padding-top: 20px; | |
| border-top: 1px solid #ccc; | |
| text-align: center; | |
| font-size: 9pt; | |
| color: #7f8c8d; | |
| } | |
| /* Print Styles - Optimized for minimal spacing */ | |
| @page { | |
| size: A4; | |
| margin: 1.2cm 1.5cm; | |
| } | |
| @page :first { | |
| margin-top: 1.5cm; | |
| } | |
| @media print { | |
| * { | |
| -webkit-print-color-adjust: exact ; | |
| print-color-adjust: exact ; | |
| } | |
| body { | |
| background: white; | |
| padding: 0; | |
| margin: 0; | |
| font-size: 9.5pt; | |
| line-height: 1.35; | |
| } | |
| /* Minimize margins */ | |
| .paper-header { | |
| margin-bottom: 15px; | |
| padding: 10px 0; | |
| page-break-after: avoid; | |
| } | |
| .paper-header h1 { | |
| font-size: 20pt; | |
| margin-bottom: 8px; | |
| } | |
| .paper-header .subtitle { | |
| font-size: 10pt; | |
| margin: 3px 0; | |
| } | |
| .abstract { | |
| margin: 12px 0; | |
| padding: 10px; | |
| page-break-after: avoid; | |
| page-break-inside: avoid; | |
| } | |
| /* Section optimization - ALLOW BREAKS */ | |
| .section { | |
| box-shadow: none; | |
| border: none; | |
| padding: 8px 10px; | |
| margin: 5px 0; | |
| page-break-inside: auto; /* Changed from avoid */ | |
| background: white; | |
| } | |
| .section-title { | |
| font-size: 12pt; | |
| margin-bottom: 6px; | |
| page-break-after: avoid; | |
| } | |
| .description { | |
| margin: 5px 0; | |
| font-size: 9.5pt; | |
| line-height: 1.35; | |
| } | |
| /* Code containers - allow breaks */ | |
| .code-container { | |
| page-break-inside: auto; | |
| margin: 8px 0; | |
| padding: 6px; | |
| border: 1px solid #ccc; | |
| } | |
| .code-header { | |
| padding: 4px 6px; | |
| margin-bottom: 4px; | |
| page-break-after: avoid; | |
| font-size: 9pt; | |
| } | |
| pre { | |
| margin: 0; | |
| padding: 6px; | |
| font-size: 7.5pt; | |
| line-height: 1.25; | |
| white-space: pre-wrap; | |
| word-wrap: break-word; | |
| } | |
| code { | |
| font-size: 7.5pt; | |
| line-height: 1.25; | |
| } | |
| /* File references */ | |
| .file-ref { | |
| margin: 5px 0; | |
| padding: 4px 6px; | |
| font-size: 8.5pt; | |
| page-break-inside: avoid; | |
| } | |
| .file-path { | |
| font-size: 8.5pt; | |
| } | |
| /* Mathematical content */ | |
| .mathematical { | |
| margin: 8px 0; | |
| padding: 6px; | |
| font-size: 9.5pt; | |
| page-break-inside: avoid; | |
| } | |
| .equation { | |
| margin: 8px 0; | |
| font-size: 10pt; | |
| } | |
| /* Key concepts and notes */ | |
| .key-concept { | |
| margin: 8px 0; | |
| padding: 6px; | |
| font-size: 9pt; | |
| page-break-inside: avoid; | |
| } | |
| .key-concept ul { | |
| margin: 4px 0 0 12px; | |
| } | |
| .key-concept li { | |
| margin: 2px 0; | |
| line-height: 1.25; | |
| } | |
| .handwritten-note { | |
| margin: 8px 0; | |
| padding: 6px; | |
| font-size: 8.5pt; | |
| page-break-inside: avoid; | |
| } | |
| .handwritten-note::before { | |
| margin-bottom: 4px; | |
| } | |
| /* Algorithm boxes */ | |
| .algorithm-box { | |
| margin: 8px 0; | |
| padding: 8px; | |
| page-break-inside: auto; /* Allow break for long algorithms */ | |
| } | |
| .algorithm-title { | |
| font-size: 10pt; | |
| margin-bottom: 6px; | |
| } | |
| .algorithm-content { | |
| font-size: 8pt; | |
| line-height: 1.4; | |
| } | |
| /* Tables */ | |
| table { | |
| margin: 8px 0; | |
| font-size: 8.5pt; | |
| page-break-inside: auto; | |
| } | |
| th, td { | |
| padding: 4px 6px; | |
| font-size: 8.5pt; | |
| } | |
| /* Page break control */ | |
| h1, h2, h3, .section-title { | |
| page-break-after: avoid; | |
| } | |
| .section:first-of-type { | |
| page-break-before: avoid; | |
| } | |
| /* Keep title with at least some content */ | |
| .section-title + .description, | |
| .code-header + pre { | |
| page-break-before: avoid; | |
| } | |
| /* Hide unnecessary elements */ | |
| .footer { | |
| display: none; | |
| } | |
| /* Compact spacing for lists */ | |
| ul, ol { | |
| margin: 4px 0; | |
| padding-left: 18px; | |
| } | |
| li { | |
| margin: 1px 0; | |
| line-height: 1.25; | |
| } | |
| /* Orphan and widow control */ | |
| p, .description, .key-concept, .handwritten-note { | |
| orphans: 2; | |
| widows: 2; | |
| } | |
| /* Reduce all vertical spacing */ | |
| * + * { | |
| margin-top: 0 ; | |
| } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <!-- Academic Paper Header --> | |
| <div class="paper-header"> | |
| <h1>ENCOT: Enhanced Codon Optimization Tool</h1> | |
| <div class="subtitle"> | |
| A Transformer-Based Approach with Augmented-Lagrangian Method<br> | |
| for Multi-Objective Codon Optimization in E. coli | |
| </div> | |
| <div class="authors"> | |
| Technical Implementation Documentation | |
| </div> | |
| </div> | |
| <!-- Abstract --> | |
| <div class="abstract"> | |
| <strong>Abstract:</strong> This document presents the technical implementation of ENCOT, a novel codon optimization | |
| system that employs transformer-based deep learning combined with an Augmented-Lagrangian Method (ALM) for | |
| precise control of GC content. The system optimizes multiple biological objectives simultaneously including | |
| Codon Adaptation Index (CAI), tRNA Adaptation Index (tAI), GC content balance, and minimization of negative | |
| cis-regulatory elements. The implementation builds upon the CodonTransformer architecture and introduces | |
| innovative constraint optimization techniques for enhanced E. coli expression systems. | |
| </div> | |
| <!-- Section 1: Core Algorithm - ALM Implementation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">1.</span> Augmented-Lagrangian Method Implementation | |
| </div> | |
| <div class="description"> | |
| The core innovation of ENCOT lies in its application of the Augmented-Lagrangian Method to enforce | |
| GC content constraints during training. This approach allows the model to balance multiple optimization | |
| objectives while maintaining biologically appropriate GC content levels. | |
| </div> | |
| <div class="mathematical"> | |
| <strong>Objective Function:</strong><br><br> | |
| <i>L</i> = <i>L</i><sub>MLM</sub> + λ·(<i>GC</i> β ΞΌ) + (Ο/2)Β·(<i>GC</i> β ΞΌ)Β² | |
| <div class="equation-label">(Eq. 1)</div> | |
| </div> | |
| <div class="key-concept"> | |
| <strong>Key Components:</strong> | |
| <ul style="margin: 10px 0 0 20px;"> | |
| <li><i>L<sub>MLM</sub></i>: Masked Language Modeling loss for codon prediction</li> | |
| <li>Ξ»: Lagrangian multiplier (adaptively updated)</li> | |
| <li>Ο: Penalty coefficient (self-tuning based on progress)</li> | |
| <li><i>GC</i>: Mean GC content of predicted sequences</li> | |
| <li>ΞΌ: Target GC content (0.52 for E. coli)</li> | |
| </ul> | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: finetune.py</div> | |
| Lines 73-148 | Class: plTrainHarness | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 1:</span> ALM Training Harness - Initialization | |
| </div> | |
| <pre><code class="language-python">class plTrainHarness(pl.LightningModule): | |
| """ | |
| PyTorch Lightning training harness for ENCOT with Augmented-Lagrangian | |
| Method (ALM) GC control. | |
| This class implements the training loop for fine-tuning CodonTransformer | |
| on E. coli sequences with precise GC content control using an | |
| Augmented-Lagrangian Method. The ALM approach allows the model to learn | |
| codon preferences while maintaining GC content within a target range. | |
| Key features: | |
| - Masked language modeling (MLM) loss for codon prediction | |
| - ALM-based GC content constraint enforcement | |
| - Curriculum learning: warm-up epochs before enforcing GC constraints | |
| - Adaptive penalty coefficient (rho) adjustment based on constraint | |
| violation progress | |
| The ALM method minimizes: | |
| L = L_MLM + λ·(GC - ΞΌ) + (Ο/2)(GC - ΞΌ)Β² | |
| where Ξ» is the Lagrangian multiplier and Ο is the penalty coefficient. | |
| """ | |
| def __init__(self, model, learning_rate, warmup_fraction, | |
| gc_penalty_weight, tokenizer, gc_target=0.52, | |
| use_lagrangian=False, lagrangian_rho=10.0, | |
| curriculum_epochs=3, alm_tolerance=1e-5, | |
| alm_dual_tolerance=1e-5, alm_penalty_update_factor=10.0, | |
| alm_initial_penalty_factor=20.0, | |
| alm_tolerance_update_factor=0.1, | |
| alm_rel_penalty_increase_threshold=0.1, | |
| alm_max_penalty=1e6, alm_min_penalty=1e-6): | |
| super().__init__() | |
| self.model = model | |
| self.learning_rate = learning_rate | |
| self.warmup_fraction = warmup_fraction | |
| self.gc_penalty_weight = gc_penalty_weight | |
| self.tokenizer = tokenizer | |
| # Augmented-Lagrangian GC Control parameters | |
| self.gc_target = gc_target | |
| self.use_lagrangian = use_lagrangian | |
| self.lagrangian_rho = lagrangian_rho | |
| self.curriculum_epochs = curriculum_epochs | |
| # Enhanced ALM parameters | |
| self.alm_tolerance = alm_tolerance | |
| self.alm_dual_tolerance = alm_dual_tolerance | |
| self.alm_penalty_update_factor = alm_penalty_update_factor | |
| self.alm_initial_penalty_factor = alm_initial_penalty_factor | |
| self.alm_tolerance_update_factor = alm_tolerance_update_factor | |
| self.alm_rel_penalty_increase_threshold = \ | |
| alm_rel_penalty_increase_threshold | |
| self.alm_max_penalty = alm_max_penalty | |
| self.alm_min_penalty = alm_min_penalty | |
| # Initialize Lagrangian multiplier as buffer | |
| # (persists across checkpoints) | |
| self.register_buffer("lambda_gc", torch.tensor(0.0)) | |
| # Adaptive penalty coefficient (rho) | |
| self.register_buffer("rho_adaptive", | |
| torch.tensor(self.lagrangian_rho)) | |
| # Step counter for periodic lambda updates | |
| self.register_buffer("step_counter", torch.tensor(0)) | |
| # ALM convergence tracking | |
| self.register_buffer("previous_constraint_violation", | |
| torch.tensor(float('inf')))</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| The initialization sets up persistent buffers for Lagrangian multipliers and penalty coefficients. | |
| These buffers are saved with model checkpoints, allowing training to resume seamlessly. The curriculum | |
| learning approach waits for 3 epochs before enforcing GC constraints, giving the model time to learn | |
| basic codon patterns first. | |
| </div> | |
| </div> | |
| <!-- Section 2: Training Step --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">2.</span> Training Step with ALM Loss Computation | |
| </div> | |
| <div class="description"> | |
| The training step combines standard masked language modeling with the ALM-based GC constraint. | |
| During each forward pass, we compute GC content from predicted tokens and apply the Lagrangian | |
| penalty to guide the model toward the target GC content. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: finetune.py</div> | |
| Lines 150-230 | Method: training_step | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 2:</span> Training Step with ALM Loss | |
| </div> | |
| <pre><code class="language-python">def training_step(self, batch, batch_idx): | |
| """ | |
| Training step that computes MLM loss and applies ALM-based GC constraint. | |
| The constraint is only enforced after curriculum_epochs warm-up period. | |
| """ | |
| outputs = self.model(**batch) | |
| mlm_loss = outputs.loss | |
| # Enhanced Lagrangian-based GC penalty | |
| if self.use_lagrangian and self.current_epoch >= self.curriculum_epochs: | |
| # Compute GC content from logits | |
| logits = outputs.logits | |
| predicted_tokens = torch.argmax(logits, dim=-1) | |
| # Calculate GC content per sequence | |
| gc_content_batch = [] | |
| for seq_tokens in predicted_tokens: | |
| # Filter to valid codon tokens (indices >= 26) | |
| valid_tokens = seq_tokens[seq_tokens >= 26] | |
| if len(valid_tokens) == 0: | |
| gc_content_batch.append(self.gc_target) | |
| continue | |
| # Count G and C containing codons | |
| gc_counts = sum(1 for token in valid_tokens | |
| if token.item() in G_indices + C_indices) | |
| gc_content = gc_counts / len(valid_tokens) | |
| gc_content_batch.append(gc_content) | |
| # Mean GC content across batch | |
| gc_content_mean = sum(gc_content_batch) / len(gc_content_batch) | |
| # Compute GC constraint violation | |
| gc_constraint = gc_content_mean - self.gc_target | |
| # Augmented Lagrangian loss term | |
| lagrangian_loss = ( | |
| self.lambda_gc * gc_constraint + | |
| (self.rho_adaptive / 2) * (gc_constraint ** 2) | |
| ) | |
| total_loss = mlm_loss + lagrangian_loss | |
| # Log metrics | |
| self.log("train/mlm_loss", mlm_loss, prog_bar=True) | |
| self.log("train/gc_constraint", gc_constraint, prog_bar=True) | |
| self.log("train/lagrangian_loss", lagrangian_loss, prog_bar=False) | |
| self.log("train/lambda_gc", self.lambda_gc, prog_bar=False) | |
| self.log("train/rho", self.rho_adaptive, prog_bar=False) | |
| self.log("train/gc_content", gc_content_mean, prog_bar=True) | |
| # Update Lagrangian multiplier periodically | |
| self.step_counter += 1 | |
| if self.step_counter % 20 == 0: | |
| self._update_alm_parameters(gc_constraint) | |
| else: | |
| # During warm-up, only use MLM loss | |
| total_loss = mlm_loss | |
| self.log("train/mlm_loss", mlm_loss, prog_bar=True) | |
| self.log("train/total_loss", total_loss, prog_bar=True) | |
| return total_loss</code></pre> | |
| </div> | |
| <div class="annotation"> | |
| <strong>Implementation Detail:</strong> The GC content is computed from the argmax of logits rather than | |
| from the actual target sequences. This allows the gradient to flow through the constraint, enabling the | |
| model to learn to satisfy the constraint during generation. | |
| </div> | |
| </div> | |
| <!-- Section 3: Adaptive Parameter Update --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">3.</span> Adaptive ALM Parameter Updates | |
| </div> | |
| <div class="description"> | |
| The self-tuning mechanism adjusts Lagrangian multipliers and penalty coefficients based on | |
| constraint violation progress. This adaptive approach ensures convergence while maintaining | |
| numerical stability. | |
| </div> | |
| <div class="algorithm-box"> | |
| <div class="algorithm-title">Algorithm 1: Adaptive Penalty Update</div> | |
| <div class="algorithm-content"> | |
| <strong>Input:</strong> gc_constraint (current violation)<br> | |
| <strong>Output:</strong> Updated Ξ»_gc and Ο_adaptive<br><br> | |
| 1. <strong>Compute</strong> relative_improvement β <br> | |
| (prev_violation - current_violation) / prev_violation<br><br> | |
| 2. <strong>If</strong> |gc_constraint| β€ tolerance <strong>then</strong><br> | |
| Ξ»_gc β Ξ»_gc + Ο Β· gc_constraint<br> | |
| // Constraint satisfied, update multiplier only<br><br> | |
| 3. <strong>Else if</strong> relative_improvement < threshold <strong>then</strong><br> | |
| Ο β min(Ο Β· update_factor, max_penalty)<br> | |
| Ξ»_gc β Ξ»_gc + Ο Β· gc_constraint<br> | |
| // Insufficient progress, increase penalty<br><br> | |
| 4. <strong>Else</strong><br> | |
| Ξ»_gc β Ξ»_gc + Ο Β· gc_constraint<br> | |
| // Good progress, keep penalty stable<br><br> | |
| 5. prev_violation β |gc_constraint| | |
| </div> | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: finetune.py</div> | |
| Lines 260-320 | Method: _update_alm_parameters | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 3:</span> Adaptive Parameter Update Implementation | |
| </div> | |
| <pre><code class="language-python">def _update_alm_parameters(self, gc_constraint): | |
| """ | |
| Update Lagrangian multiplier and penalty coefficient according to ALM. | |
| This implements the adaptive penalty update strategy: | |
| - If constraint violation is decreasing sufficiently, update lambda | |
| and keep rho | |
| - If constraint violation is not improving, increase rho | |
| (penalty coefficient) | |
| """ | |
| constraint_violation = abs(gc_constraint.item()) | |
| # Check if we're making sufficient progress | |
| relative_improvement = ( | |
| (self.previous_constraint_violation - constraint_violation) / | |
| max(self.previous_constraint_violation, 1e-8) | |
| ) | |
| if constraint_violation <= self.alm_tolerance: | |
| # Constraint satisfied - update lambda, optionally reduce rho | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| # Could reduce rho here if desired, but keeping it stable | |
| # works well in practice | |
| elif relative_improvement < self.alm_rel_penalty_increase_threshold: | |
| # Not making enough progress - increase penalty | |
| self.rho_adaptive = torch.clamp( | |
| self.rho_adaptive * self.alm_penalty_update_factor, | |
| min=self.alm_min_penalty, | |
| max=self.alm_max_penalty | |
| ) | |
| # Also update lambda | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| else: | |
| # Making good progress - just update lambda | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| # Update tracking | |
| self.previous_constraint_violation = torch.tensor(constraint_violation)</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| The key insight here is the relative improvement threshold. If the constraint violation isn't | |
| improving by at least 10% (default threshold), we increase the penalty coefficient. This ensures | |
| that the optimization doesn't get stuck in suboptimal regions where the constraint is consistently | |
| violated. | |
| </div> | |
| </div> | |
| <!-- Section 4: Prediction Function --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">4.</span> DNA Sequence Prediction with Constrained Search | |
| </div> | |
| <div class="description"> | |
| The prediction function supports multiple decoding strategies including deterministic (greedy), | |
| stochastic (temperature sampling), and constrained beam search with GC bounds. This flexibility | |
| allows users to balance between optimization quality and sequence diversity. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonPrediction.py</div> | |
| Lines 38-120 | Function: predict_dna_sequence | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 4:</span> Main Prediction Function Signature | |
| </div> | |
| <pre><code class="language-python">def predict_dna_sequence( | |
| protein: str, | |
| organism: Union[int, str], | |
| device: torch.device, | |
| tokenizer: Union[str, PreTrainedTokenizerFast] = None, | |
| model: Union[str, torch.nn.Module] = None, | |
| attention_type: str = "original_full", | |
| deterministic: bool = True, | |
| temperature: float = 0.2, | |
| top_p: float = 0.95, | |
| num_sequences: int = 1, | |
| match_protein: bool = False, | |
| use_constrained_search: bool = False, | |
| gc_bounds: Tuple[float, float] = (0.30, 0.70), | |
| beam_size: int = 5, | |
| length_penalty: float = 1.0, | |
| diversity_penalty: float = 0.0, | |
| ) -> Union[DNASequencePrediction, List[DNASequencePrediction]]: | |
| """ | |
| Predict the DNA sequence(s) for a given protein using ENCOT model. | |
| This function takes a protein sequence and an organism (as ID or name) | |
| as input and returns the predicted DNA sequence(s) using the ENCOT model. | |
| It can use either provided tokenizer and model objects or load them from | |
| specified paths. | |
| Args: | |
| protein (str): The input protein sequence for which to predict | |
| the DNA sequence. | |
| organism (Union[int, str]): Either the ID of the organism or its | |
| name (e.g., "Escherichia coli general"). | |
| device (torch.device): The device (CPU or GPU) to run the model on. | |
| deterministic (bool, optional): Whether to use deterministic decoding | |
| (most likely tokens). If False, samples tokens according to their | |
| probabilities adjusted by the temperature. Defaults to True. | |
| temperature (float, optional): A value controlling the randomness of | |
| predictions during non-deterministic decoding. Lower values | |
| (e.g., 0.2) make the model more conservative, while higher values | |
| (e.g., 0.8) increase randomness. Defaults to 0.2. | |
| use_constrained_search (bool, optional): Enable constrained beam | |
| search with GC bounds. Defaults to False. | |
| gc_bounds (Tuple[float, float], optional): GC content bounds | |
| (min, max) for constrained search. Defaults to (0.30, 0.70). | |
| beam_size (int, optional): Beam size for beam search. Defaults to 5. | |
| match_protein (bool, optional): Ensures the predicted DNA sequence | |
| translates to the input protein sequence by sampling from only | |
| the respective codons of each amino acid. Defaults to False. | |
| Returns: | |
| Union[DNASequencePrediction, List[DNASequencePrediction]]: | |
| Predicted DNA sequence(s) with associated metrics. | |
| """</code></pre> | |
| </div> | |
| <div class="key-concept"> | |
| <strong>Decoding Strategies:</strong> | |
| <table style="margin-top: 15px;"> | |
| <tr> | |
| <th>Strategy</th> | |
| <th>Use Case</th> | |
| <th>Parameters</th> | |
| </tr> | |
| <tr> | |
| <td><strong>Greedy (deterministic)</strong></td> | |
| <td>Production optimization</td> | |
| <td>deterministic=True</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Temperature Sampling</strong></td> | |
| <td>Diversity exploration</td> | |
| <td>deterministic=False, temperature=0.2-0.8</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Constrained Beam Search</strong></td> | |
| <td>GC-constrained optimization</td> | |
| <td>use_constrained_search=True, gc_bounds=(0.45,0.55)</td> | |
| </tr> | |
| </table> | |
| </div> | |
| </div> | |
| <!-- Section 5: Evaluation Metrics --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">5.</span> Evaluation Metrics Implementation | |
| </div> | |
| <div class="description"> | |
| ENCOT computes comprehensive metrics to evaluate the quality of optimized sequences. The primary | |
| metrics are the Codon Adaptation Index (CAI) and tRNA Adaptation Index (tAI), which quantify how | |
| well the codon usage matches highly expressed E. coli genes and available tRNA pools, respectively. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonEvaluation.py</div> | |
| Lines 23-50, 370-420 | Functions: get_CSI_value, calculate_tAI | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 5:</span> CAI and tAI Calculation | |
| </div> | |
| <pre><code class="language-python">def get_CSI_weights(sequences: List[str]) -> Dict[str, float]: | |
| """ | |
| Calculate the Codon Similarity Index (CSI) weights for a list of | |
| DNA sequences. | |
| CSI is equivalent to CAI when computed from reference sequences. | |
| Args: | |
| sequences (List[str]): List of DNA sequences from highly expressed | |
| genes. | |
| Returns: | |
| dict: The CSI weights (relative adaptiveness values per codon). | |
| """ | |
| return relative_adaptiveness(sequences=sequences) | |
| def get_CSI_value(dna: str, weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate the Codon Similarity Index (CSI) for a DNA sequence. | |
| This is the CAI score computed using pre-calculated weights. | |
| Args: | |
| dna (str): The DNA sequence. | |
| weights (dict): The CSI weights from get_CSI_weights. | |
| Returns: | |
| float: The CSI value (range 0-1, higher is better). | |
| """ | |
| return CAI(dna, weights) | |
| def get_ecoli_tai_weights(): | |
| """ | |
| Returns pre-calculated tAI weights for E. coli K-12 MG1655. | |
| These weights are based on tRNA gene copy numbers and wobble base | |
| pairing rules. Higher weights indicate more available tRNA for | |
| that codon. | |
| Returns: | |
| dict: Mapping from codon to tAI weight (0-1). | |
| """ | |
| return { | |
| 'TTT': 0.58, 'TTC': 0.42, 'TTA': 0.13, 'TTG': 0.13, | |
| 'TCT': 0.15, 'TCC': 0.15, 'TCA': 0.12, 'TCG': 0.15, | |
| 'TAT': 0.59, 'TAC': 0.41, 'TGT': 0.46, 'TGC': 0.54, | |
| 'TGG': 1.00, 'CTT': 0.11, 'CTC': 0.10, 'CTA': 0.04, | |
| 'CTG': 0.49, 'CCT': 0.16, 'CCC': 0.12, 'CCA': 0.19, | |
| 'CCG': 0.52, 'CAT': 0.57, 'CAC': 0.43, 'CAA': 0.34, | |
| 'CAG': 0.66, 'ATT': 0.51, 'ATC': 0.42, 'ATA': 0.07, | |
| 'ATG': 1.00, 'ACT': 0.17, 'ACC': 0.44, 'ACA': 0.13, | |
| 'ACG': 0.27, 'AAT': 0.49, 'AAC': 0.51, 'AAA': 0.76, | |
| 'AAG': 0.24, 'AGT': 0.15, 'AGC': 0.28, 'AGA': 0.07, | |
| 'AGG': 0.04, 'GTT': 0.28, 'GTC': 0.20, 'GTA': 0.15, | |
| 'GTG': 0.37, 'GCT': 0.18, 'GCC': 0.27, 'GCA': 0.21, | |
| 'GCG': 0.36, 'GAT': 0.63, 'GAC': 0.37, 'GAA': 0.68, | |
| 'GAG': 0.32, 'GGT': 0.35, 'GGC': 0.40, 'GGA': 0.11, | |
| 'GGG': 0.15, | |
| } | |
| def calculate_tAI(sequence: str, tai_weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate the tRNA Adaptation Index (tAI) for a DNA sequence. | |
| The tAI is the geometric mean of the tAI weights for all codons in | |
| the sequence (excluding stop codons). | |
| Args: | |
| sequence (str): DNA sequence (must be divisible by 3) | |
| tai_weights (Dict[str, float]): tAI weights for each codon | |
| Returns: | |
| float: Geometric mean of tAI weights (range 0-1) | |
| """ | |
| if len(sequence) % 3 != 0: | |
| raise ValueError("Sequence length must be divisible by 3") | |
| # Split into codons | |
| codons = [sequence[i:i+3].upper() for i in range(0, len(sequence), 3)] | |
| # Get weights for non-stop codons | |
| weights = [tai_weights.get(codon, 0.5) for codon in codons | |
| if codon not in ['TAA', 'TAG', 'TGA']] | |
| if not weights: | |
| return 0.0 | |
| # Compute geometric mean | |
| product = 1.0 | |
| for w in weights: | |
| product *= w | |
| return product ** (1.0 / len(weights))</code></pre> | |
| </div> | |
| <div class="annotation"> | |
| <strong>Metric Interpretation:</strong> Both CAI and tAI range from 0 to 1, with higher values | |
| indicating better optimization. In practice, for E. coli: | |
| <ul style="margin: 10px 0 0 20px;"> | |
| <li>CAI > 0.8 indicates excellent codon adaptation</li> | |
| <li>tAI > 0.4 suggests adequate tRNA availability</li> | |
| <li>Native E. coli genes typically have CAI around 0.65-0.75</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <!-- Section 6: Training Configuration --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">6.</span> Training Configuration | |
| </div> | |
| <div class="description"> | |
| The training configuration specifies all hyperparameters including learning rate, batch size, | |
| and ALM-specific settings. This configuration reproduces the exact setup used in our experiments. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: configs/train_ecoli_alm.yaml</div> | |
| Complete configuration file | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 6:</span> Complete Training Configuration | |
| </div> | |
| <pre><code class="language-yaml"># ENCOT ALM Training Configuration | |
| # This configuration reproduces the main training setup from the paper | |
| # using the Augmented-Lagrangian Method (ALM) for GC content control. | |
| model: | |
| base_model: "adibvafa/CodonTransformer-base" | |
| tokenizer: "adibvafa/CodonTransformer" | |
| data: | |
| dataset_dir: "data" | |
| # Expected files: finetune_set.json (created by preprocess_data.py) | |
| training: | |
| batch_size: 6 | |
| max_epochs: 15 | |
| learning_rate: 5e-5 | |
| warmup_fraction: 0.1 | |
| num_workers: 5 | |
| accumulate_grad_batches: 1 | |
| num_gpus: 4 | |
| save_every_n_steps: 512 | |
| seed: 123 | |
| log_every_n_steps: 20 | |
| checkpoint: | |
| checkpoint_dir: "models/alm-enhanced-training" | |
| checkpoint_filename: "balanced_alm_finetune.ckpt" | |
| # Augmented-Lagrangian Method (ALM) for GC content control | |
| alm: | |
| enabled: true | |
| gc_target: 0.52 # Target GC content for E. coli (52%) | |
| curriculum_epochs: 3 # Warm-up epochs before enforcing GC constraint | |
| # ALM penalty parameters | |
| initial_penalty_factor: 20.0 | |
| penalty_update_factor: 10.0 | |
| max_penalty: 1e6 | |
| min_penalty: 1e-6 | |
| # ALM tolerance parameters | |
| tolerance: 1e-5 # Primal tolerance | |
| dual_tolerance: 1e-5 # Dual tolerance for constraint violation | |
| tolerance_update_factor: 0.1 | |
| # Adaptive penalty adjustment | |
| rel_penalty_increase_threshold: 0.1 | |
| # Legacy penalty method (if ALM disabled) | |
| gc_penalty: | |
| weight: 0.0 # Only used if use_lagrangian=false</code></pre> | |
| </div> | |
| <div class="key-concept"> | |
| <strong>Hyperparameter Selection Rationale:</strong> | |
| <table style="margin-top: 15px;"> | |
| <tr> | |
| <th>Parameter</th> | |
| <th>Value</th> | |
| <th>Rationale</th> | |
| </tr> | |
| <tr> | |
| <td>gc_target</td> | |
| <td>0.52</td> | |
| <td>Native E. coli genome GC content</td> | |
| </tr> | |
| <tr> | |
| <td>curriculum_epochs</td> | |
| <td>3</td> | |
| <td>Allow basic pattern learning before constraint</td> | |
| </tr> | |
| <tr> | |
| <td>initial_penalty_factor</td> | |
| <td>20.0</td> | |
| <td>Moderate initial constraint enforcement</td> | |
| </tr> | |
| <tr> | |
| <td>penalty_update_factor</td> | |
| <td>10.0</td> | |
| <td>Aggressive adaptation for fast convergence</td> | |
| </tr> | |
| </table> | |
| </div> | |
| </div> | |
| <!-- Section 7: Data Validation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">7.</span> Sequence Validation Pipeline | |
| </div> | |
| <div class="description"> | |
| Before training, all DNA sequences undergo rigorous validation to ensure biological correctness. | |
| Invalid sequences are filtered out to maintain data quality. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: prepare_ecoli_data.py</div> | |
| Lines 5-30 | Function: is_valid_sequence | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 7:</span> Sequence Validation Function | |
| </div> | |
| <pre><code class="language-python">def is_valid_sequence(dna_seq: str) -> bool: | |
| """ | |
| Applies a series of validation checks to a DNA sequence. | |
| Validation criteria: | |
| 1. Length must be divisible by 3 (valid codon frame) | |
| 2. Must start with a valid start codon (ATG, TTG, CTG, or GTG) | |
| 3. Must end with a valid stop codon (TAA, TAG, or TGA) | |
| 4. Must not contain internal stop codons | |
| 5. Must contain only valid nucleotides (A, T, G, C) | |
| Args: | |
| dna_seq (str): The DNA sequence to validate. | |
| Returns: | |
| bool: True if the sequence passes all checks, False otherwise. | |
| """ | |
| # Check 1: Valid codon frame | |
| if len(dna_seq) % 3 != 0: | |
| return False | |
| # Check 2: Valid start codon | |
| if not dna_seq.upper().startswith(('ATG', 'TTG', 'CTG', 'GTG')): | |
| return False | |
| # Check 3: Valid stop codon | |
| if not dna_seq.upper().endswith(('TAA', 'TAG', 'TGA')): | |
| return False | |
| # Check 4: No internal stop codons (excluding the last codon) | |
| codons = [dna_seq[i:i+3].upper() | |
| for i in range(0, len(dna_seq) - 3, 3)] | |
| if any(codon in ['TAA', 'TAG', 'TGA'] for codon in codons): | |
| return False | |
| # Check 5: Only valid nucleotides | |
| if not all(c in 'ATGC' for c in dna_seq.upper()): | |
| return False | |
| return True</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| The validation function is intentionally strict to ensure high-quality training data. In our | |
| preprocessing of the E. coli genome, approximately 95% of sequences passed all validation checks. | |
| The most common reason for rejection was sequences with internal stop codons due to sequencing | |
| errors or pseudogenes. | |
| </div> | |
| </div> | |
| <!-- Section 8: Benchmark Evaluation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">8.</span> Benchmark Evaluation Pipeline | |
| </div> | |
| <div class="description"> | |
| The benchmark pipeline evaluates ENCOT on a test set of protein sequences, computing multiple | |
| metrics for each optimized sequence and generating comprehensive performance reports. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: benchmark_evaluation.py</div> | |
| Lines 300-400 | Function: benchmark_sequences | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 8:</span> Benchmark Evaluation Function | |
| </div> | |
| <pre><code class="language-python">def benchmark_sequences(sequences, model, tokenizer, device, | |
| cai_weights, tai_weights): | |
| """ | |
| Run ENCOT on protein sequences and compute metrics for optimized DNA. | |
| Args: | |
| sequences: List of (name, protein) tuples to optimize | |
| model: Loaded ENCOT model | |
| tokenizer: Tokenizer for the model | |
| device: PyTorch device (CPU/GPU) | |
| cai_weights: Pre-computed CAI weights from reference sequences | |
| tai_weights: Pre-computed tAI weights for E. coli | |
| Returns: | |
| DataFrame with columns: name, protein, optimized_dna, CAI, tAI, | |
| GC_content, negative_cis_elements | |
| """ | |
| results = [] | |
| for name, protein in tqdm(sequences, desc="Optimizing sequences"): | |
| # Optimize the sequence using ENCOT | |
| output = predict_dna_sequence( | |
| protein=protein, | |
| organism="Escherichia coli general", | |
| device=device, | |
| model=model, | |
| tokenizer=tokenizer, | |
| deterministic=True, | |
| use_constrained_search=True, | |
| gc_bounds=(0.45, 0.55) # E. coli optimal range | |
| ) | |
| optimized_dna = output.predicted_dna | |
| # Calculate comprehensive metrics | |
| cai = get_CSI_value(optimized_dna, cai_weights) | |
| tai = calculate_tAI(optimized_dna, tai_weights) | |
| gc_content = get_GC_content(optimized_dna) | |
| cis_elements = count_negative_cis_elements(optimized_dna) | |
| homopolymers = calculate_homopolymer_runs(optimized_dna) | |
| results.append({ | |
| 'name': name, | |
| 'protein': protein, | |
| 'optimized_dna': optimized_dna, | |
| 'length': len(optimized_dna), | |
| 'CAI': cai, | |
| 'tAI': tai, | |
| 'GC_content': gc_content, | |
| 'negative_cis_elements': cis_elements, | |
| 'max_homopolymer_length': homopolymers | |
| }) | |
| return pd.DataFrame(results)</code></pre> | |
| </div> | |
| <div class="key-concept"> | |
| <strong>Benchmark Metrics Summary:</strong> | |
| <ul style="margin: 10px 0 0 20px;"> | |
| <li><strong>CAI:</strong> Measures codon usage similarity to highly expressed genes</li> | |
| <li><strong>tAI:</strong> Quantifies tRNA availability for translation</li> | |
| <li><strong>GC Content:</strong> Should be near 52% for E. coli</li> | |
| <li><strong>Negative cis-elements:</strong> Count of problematic regulatory sequences</li> | |
| <li><strong>Homopolymers:</strong> Long runs that cause synthesis issues</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <!-- Section 9: Usage Example --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">9.</span> Complete Usage Example | |
| </div> | |
| <div class="description"> | |
| This example demonstrates a complete workflow: loading the model, optimizing a sequence, and | |
| evaluating the results. This is the recommended pattern for production use. | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 9:</span> End-to-End Optimization Workflow | |
| </div> | |
| <pre><code class="language-python">#!/usr/bin/env python3 | |
| """ | |
| Complete workflow example for ENCOT codon optimization. | |
| """ | |
| import torch | |
| from transformers import AutoTokenizer | |
| from CodonTransformer.CodonPrediction import load_model, predict_dna_sequence | |
| from CodonTransformer.CodonEvaluation import ( | |
| get_GC_content, calculate_tAI, get_CSI_value, | |
| get_ecoli_tai_weights, count_negative_cis_elements | |
| ) | |
| from CAI import relative_adaptiveness | |
| from huggingface_hub import hf_hub_download | |
| # Step 1: Setup device and load model | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| print(f"Using device: {device}") | |
| # Download model from HuggingFace | |
| checkpoint_path = hf_hub_download( | |
| repo_id="saketh11/ColiFormer", | |
| filename="balanced_alm_finetune.ckpt", | |
| cache_dir="./hf_cache" | |
| ) | |
| model = load_model(model_path=checkpoint_path, device=device) | |
| tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer") | |
| # Step 2: Define protein to optimize | |
| protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG" | |
| print(f"Input protein ({len(protein)} aa): {protein}") | |
| # Step 3: Optimize the sequence | |
| print("\nOptimizing...") | |
| output = predict_dna_sequence( | |
| protein=protein, | |
| organism="Escherichia coli general", | |
| device=device, | |
| model=model, | |
| tokenizer=tokenizer, | |
| deterministic=True, | |
| match_protein=True, | |
| use_constrained_search=True, | |
| gc_bounds=(0.45, 0.55), | |
| beam_size=20 | |
| ) | |
| optimized_dna = output.predicted_dna | |
| print(f"Optimized DNA ({len(optimized_dna)} bp): {optimized_dna[:60]}...") | |
| # Step 4: Evaluate metrics | |
| print("\nComputing metrics...") | |
| # Load reference weights | |
| tai_weights = get_ecoli_tai_weights() | |
| # For CAI, we need reference sequences (use E. coli highly expressed genes) | |
| # In practice, load from your reference dataset | |
| reference_sequences = load_reference_sequences() # Your function | |
| cai_weights = relative_adaptiveness(reference_sequences) | |
| # Calculate metrics | |
| cai = get_CSI_value(optimized_dna, cai_weights) | |
| tai = calculate_tAI(optimized_dna, tai_weights) | |
| gc = get_GC_content(optimized_dna) | |
| cis = count_negative_cis_elements(optimized_dna) | |
| # Step 5: Report results | |
| print("\n" + "="*50) | |
| print("OPTIMIZATION RESULTS") | |
| print("="*50) | |
| print(f"CAI (Codon Adaptation Index): {cai:.4f}") | |
| print(f"tAI (tRNA Adaptation Index): {tai:.4f}") | |
| print(f"GC Content: {gc:.2f}%") | |
| print(f"Negative cis-regulatory elements: {cis}") | |
| print("="*50) | |
| # Step 6: Verify translation | |
| from Bio.Seq import Seq | |
| translated = str(Seq(optimized_dna).translate()) | |
| assert translated == protein, "Translation mismatch!" | |
| print("\nβ Optimized DNA correctly translates to input protein")</code></pre> | |
| </div> | |
| </div> | |
| <!-- Section 11: Constrained Beam Search --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">11.</span> Constrained Beam Search Implementation | |
| </div> | |
| <div class="description"> | |
| The constrained beam search algorithm ensures that generated DNA sequences maintain GC content within specified bounds. This method prunes candidates that violate constraints during generation, improving efficiency compared to post-hoc filtering. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonPrediction.py</div> | |
| Lines 850-950 | Function: _constrained_beam_search() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 11:</span> Constrained Beam Search Core | |
| </div> | |
| <pre><code class="language-python">def _constrained_beam_search(model, input_ids, attention_mask, | |
| beam_size, gc_bounds, max_len, device): | |
| """ | |
| Constrained beam search that enforces GC content bounds during generation. | |
| Args: | |
| model: CodonTransformer model | |
| input_ids: Tokenized input [batch_size, seq_len] | |
| attention_mask: Attention mask | |
| beam_size: Number of candidates to maintain | |
| gc_bounds: (min_gc, max_gc) tuple for GC content | |
| max_len: Maximum sequence length | |
| device: torch device | |
| Returns: | |
| Best sequence satisfying GC constraints | |
| """ | |
| batch_size = input_ids.size(0) | |
| min_gc, max_gc = gc_bounds | |
| # Initialize beams: (sequence, score, gc_count, length) | |
| beams = [(input_ids[0].clone(), 0.0, 0, 0)] | |
| for step in range(max_len): | |
| all_candidates = [] | |
| for seq, score, gc_count, length in beams: | |
| # Get model predictions | |
| with torch.no_grad(): | |
| outputs = model(seq.unsqueeze(0)) | |
| logits = outputs.logits[0, -1, :] # Last position | |
| probs = torch.softmax(logits, dim=-1) | |
| # Get top-k tokens | |
| top_probs, top_indices = torch.topk(probs, beam_size * 2) | |
| for prob, token_id in zip(top_probs, top_indices): | |
| # Decode token to codon | |
| token = tokenizer.decode([token_id]) | |
| # Calculate GC content | |
| new_gc_count = gc_count + token.count('G') + token.count('C') | |
| new_length = length + len(token) | |
| current_gc = new_gc_count / new_length if new_length > 0 else 0.0 | |
| # Check GC constraint (with some relaxation early on) | |
| relaxation = max(0.1, 1.0 - step / max_len) | |
| if min_gc - relaxation <= current_gc <= max_gc + relaxation: | |
| new_seq = torch.cat([seq, token_id.unsqueeze(0)]) | |
| new_score = score + torch.log(prob).item() | |
| all_candidates.append((new_seq, new_score, | |
| new_gc_count, new_length)) | |
| # Select top beams | |
| all_candidates.sort(key=lambda x: x[1], reverse=True) | |
| beams = all_candidates[:beam_size] | |
| if not beams: | |
| raise ValueError("No valid candidates found within GC bounds") | |
| # Return best sequence | |
| return beams[0][0]</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| The relaxation factor allows more flexibility early in generation, gradually tightening constraints as the sequence grows. This prevents premature pruning of potentially good candidates. | |
| </div> | |
| </div> | |
| <!-- Section 12: GC Content Calculation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">12.</span> GC Content Analysis | |
| </div> | |
| <div class="description"> | |
| Precise GC content calculation is critical for both training constraints and sequence evaluation. The implementation handles edge cases and provides window-based analysis for local GC variations. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonEvaluation.py</div> | |
| Lines 245-285 | Function: get_GC_content() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 12:</span> GC Content Calculation | |
| </div> | |
| <pre><code class="language-python">def get_GC_content(dna_sequence: str, window_size: int = None) -> float: | |
| """ | |
| Calculate GC content of a DNA sequence. | |
| Args: | |
| dna_sequence: DNA sequence string | |
| window_size: If provided, calculate sliding window GC content | |
| Returns: | |
| GC content as percentage (0-100) or list of windowed values | |
| """ | |
| if not dna_sequence: | |
| raise ValueError("DNA sequence cannot be empty") | |
| # Convert to uppercase and validate | |
| dna_sequence = dna_sequence.upper() | |
| valid_bases = set('ATGC') | |
| if not all(base in valid_bases for base in dna_sequence): | |
| raise ValueError("DNA sequence contains invalid characters") | |
| if window_size is None: | |
| # Global GC content | |
| gc_count = dna_sequence.count('G') + dna_sequence.count('C') | |
| total = len(dna_sequence) | |
| return (gc_count / total) * 100.0 if total > 0 else 0.0 | |
| else: | |
| # Sliding window GC content | |
| if window_size <= 0 or window_size > len(dna_sequence): | |
| raise ValueError(f"Invalid window size: {window_size}") | |
| gc_values = [] | |
| for i in range(len(dna_sequence) - window_size + 1): | |
| window = dna_sequence[i:i + window_size] | |
| gc_count = window.count('G') + window.count('C') | |
| gc_pct = (gc_count / window_size) * 100.0 | |
| gc_values.append(gc_pct) | |
| return gc_values | |
| def calculate_gc_variance(dna_sequence: str, window_size: int = 100) -> float: | |
| """Calculate variance in GC content across sequence windows""" | |
| gc_values = get_GC_content(dna_sequence, window_size) | |
| if len(gc_values) < 2: | |
| return 0.0 | |
| mean_gc = sum(gc_values) / len(gc_values) | |
| variance = sum((x - mean_gc) ** 2 for x in gc_values) / len(gc_values) | |
| return variance</code></pre> | |
| </div> | |
| </div> | |
| <!-- Section 13: Tokenization Pipeline --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">13.</span> Sequence Tokenization | |
| </div> | |
| <div class="description"> | |
| The tokenization pipeline converts protein and DNA sequences into codon-level tokens that the transformer can process. Each codon is represented as a single token (e.g., "l_ctg" for leucine codon CTG). | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonUtils.py</div> | |
| Lines 35-130 | Constant: TOKEN2INDEX | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 13:</span> Codon Tokenization Dictionary | |
| </div> | |
| <pre><code class="language-python"># Codon-to-token mapping: amino_acid_codon format | |
| TOKEN2INDEX = { | |
| "[PAD]": 0, # Padding token | |
| "[UNK]": 1, # Unknown token | |
| "[CLS]": 2, # Classification token | |
| "[SEP]": 3, # Separator token | |
| "[MASK]": 4, # Mask token for MLM | |
| # Amino acid codons (format: amino_codon) | |
| "a_gca": 62, # Alanine - GCA | |
| "a_gcc": 63, # Alanine - GCC | |
| "a_gcg": 64, # Alanine - GCG | |
| "a_gct": 65, # Alanine - GCT | |
| "c_tgc": 83, # Cysteine - TGC | |
| "c_tgt": 85, # Cysteine - TGT | |
| "d_gac": 59, # Aspartate - GAC | |
| "d_gat": 61, # Aspartate - GAT | |
| "e_gaa": 58, # Glutamate - GAA | |
| "e_gag": 60, # Glutamate - GAG | |
| "f_ttc": 87, # Phenylalanine - TTC | |
| "f_ttt": 89, # Phenylalanine - TTT | |
| "g_gga": 66, # Glycine - GGA | |
| "g_ggc": 67, # Glycine - GGC | |
| "g_ggg": 68, # Glycine - GGG | |
| "g_ggt": 69, # Glycine - GGT | |
| # ... (61 codon tokens total for all amino acids) | |
| "__taa": 74, # Stop codon - TAA | |
| "__tag": 76, # Stop codon - TAG | |
| "__tga": 82, # Stop codon - TGA | |
| } | |
| # Organism ID mapping (164 organisms supported) | |
| ORGANISM2ID = { | |
| "Escherichia coli general": 0, | |
| "Homo sapiens": 1, | |
| "Saccharomyces cerevisiae": 2, | |
| "Bacillus subtilis": 3, | |
| # ... (160 more organisms) | |
| } | |
| def get_merged_seq(protein: str, dna: str = "", | |
| include_start_codon: bool = True) -> str: | |
| """ | |
| Merge protein and DNA into codon tokens. | |
| For training: protein + DNA codons | |
| For inference: protein + [MASK] tokens | |
| Args: | |
| protein: Amino acid sequence | |
| dna: DNA sequence (empty for inference) | |
| include_start_codon: Add ATG start codon | |
| Returns: | |
| Space-separated codon tokens | |
| """ | |
| tokens = ["[CLS]"] | |
| if include_start_codon: | |
| tokens.append("m_atg") # Start codon | |
| # Convert protein to amino acid tokens | |
| for aa in protein.lower(): | |
| if dna: | |
| # Training: use actual codons from DNA | |
| codon = dna[:3].lower() | |
| dna = dna[3:] | |
| token = f"{aa}_{codon}" | |
| else: | |
| # Inference: use [MASK] for model to predict | |
| token = "[MASK]" | |
| tokens.append(token) | |
| tokens.append("[SEP]") | |
| return " ".join(tokens)</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| The codon token format (amino_codon) ensures the model learns both the amino acid identity and its preferred codon, enabling organism-specific optimization. | |
| </div> | |
| </div> | |
| <!-- Section 14: Model Architecture Details --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">14.</span> BigBird Transformer Architecture | |
| </div> | |
| <div class="description"> | |
| ENCOT employs a BigBird transformer with block-sparse attention, allowing it to process long sequences (up to 2048 tokens) efficiently. The model has 89.6 million parameters. | |
| </div> | |
| <div class="algorithm-box"> | |
| <div class="algorithm-title">Algorithm 2: Block-Sparse Attention</div> | |
| <div class="algorithm-content"> | |
| # BigBird Attention Patterns: | |
| # 1. Global attention: All positions attend to [CLS] token | |
| # 2. Random attention: Each position attends to r random positions | |
| # 3. Local attention: Each position attends to w neighboring positions | |
| # | |
| # Parameters: | |
| # - Block size: 64 tokens | |
| # - Number of random blocks: 3 | |
| # - Window size: 3 blocks (192 tokens) | |
| # | |
| # Complexity: O(n) instead of O(nΒ²) for full attention | |
| for each query position i: | |
| # 1. Global tokens (always included) | |
| attend_to(CLS_token) | |
| # 2. Local window (w=3 blocks) | |
| for j in range(i - window_size, i + window_size): | |
| if 0 <= j < seq_len: | |
| attend_to(position_j) | |
| # 3. Random positions (r=3 blocks) | |
| random_positions = sample_random(num_blocks=3) | |
| for j in random_positions: | |
| attend_to(position_j) | |
| # Memory: O(n * (w + r + g)) where g = global tokens | |
| </div> | |
| </div> | |
| <div class="key-concept"> | |
| <strong>Model Configuration:</strong> | |
| <ul style="margin: 10px 0 0 20px;"> | |
| <li>Hidden size: 768</li> | |
| <li>Number of layers: 12</li> | |
| <li>Attention heads: 12</li> | |
| <li>Intermediate size: 3072</li> | |
| <li>Max position embeddings: 2048</li> | |
| <li>Vocabulary size: 95 tokens (61 codons + special tokens + organism IDs)</li> | |
| <li>Total parameters: 89,584,895</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <!-- Section 15: CAI Calculation Details --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">15.</span> Codon Adaptation Index (CAI) | |
| </div> | |
| <div class="description"> | |
| CAI measures how well a sequence's codon usage matches the host organism's preferred codons. Values range from 0 to 1, with higher values indicating better adaptation. | |
| </div> | |
| <div class="mathematical"> | |
| <strong>CAI Formula:</strong><br><br> | |
| <i>CAI</i> = exp( (1/<i>L</i>) Β· Ξ£ ln(<i>w<sub>i</sub></i>) ) | |
| <div class="equation-label">(Eq. 2)</div> | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonEvaluation.py</div> | |
| Lines 85-140 | Function: get_CSI_value() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 15:</span> CAI Calculation | |
| </div> | |
| <pre><code class="language-python">def get_CSI_value(dna_sequence: str, weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate Codon Adaptation Index (CAI) for a DNA sequence. | |
| CAI = exp( (1/L) * sum(ln(w_i)) ) | |
| where: | |
| L = number of codons | |
| w_i = relative adaptedness of codon i | |
| Args: | |
| dna_sequence: DNA sequence (must be multiple of 3) | |
| weights: Dictionary mapping codons to weights (0-1) | |
| Returns: | |
| CAI value (0-1, higher is better) | |
| """ | |
| from CAI import CAI as CAI_calculator | |
| if len(dna_sequence) % 3 != 0: | |
| raise ValueError("DNA sequence length must be multiple of 3") | |
| # Remove stop codons for CAI calculation | |
| stop_codons = {'TAA', 'TAG', 'TGA'} | |
| codons = [dna_sequence[i:i+3].upper() | |
| for i in range(0, len(dna_sequence), 3)] | |
| codons = [c for c in codons if c not in stop_codons] | |
| if not codons: | |
| return 0.0 | |
| # Calculate CAI using log-geometric mean | |
| try: | |
| cai = CAI_calculator( | |
| sequence=dna_sequence, | |
| weights=weights | |
| ) | |
| return cai | |
| except Exception as e: | |
| # Fallback: manual calculation | |
| log_sum = 0.0 | |
| count = 0 | |
| for codon in codons: | |
| if codon in weights: | |
| weight = weights[codon] | |
| if weight > 0: | |
| log_sum += math.log(weight) | |
| count += 1 | |
| if count == 0: | |
| return 0.0 | |
| cai = math.exp(log_sum / count) | |
| return cai | |
| def get_organism_cai_weights(organism: str) -> Dict[str, float]: | |
| """Load organism-specific CAI weights from reference genomes""" | |
| # Weights represent relative codon usage in highly expressed genes | |
| # Calculated from top 10% expressed genes in the organism | |
| weights_file = f"data/cai_weights/{organism.replace(' ', '_')}.json" | |
| with open(weights_file, 'r') as f: | |
| return json.load(f)</code></pre> | |
| </div> | |
| </div> | |
| <!-- Section 16: tAI Calculation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">16.</span> tRNA Adaptation Index (tAI) | |
| </div> | |
| <div class="description"> | |
| tAI estimates translation efficiency based on tRNA availability and codon-anticodon binding strength. It accounts for wobble base pairing and tRNA gene copy numbers. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonEvaluation.py</div> | |
| Lines 180-240 | Function: calculate_tAI() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 16:</span> tAI Calculation | |
| </div> | |
| <pre><code class="language-python">def calculate_tAI(dna_sequence: str, tai_weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate tRNA Adaptation Index (tAI). | |
| tAI accounts for: | |
| 1. tRNA gene copy numbers | |
| 2. Wobble base pairing efficiency | |
| 3. Codon-anticodon binding strength | |
| tAI = geometric_mean( w_i * (1 - s_i) ) | |
| where: | |
| w_i = tRNA availability for codon i | |
| s_i = selection coefficient (wobble penalty) | |
| Args: | |
| dna_sequence: DNA sequence | |
| tai_weights: Pre-calculated weights per codon | |
| Returns: | |
| tAI value (0-1, higher indicates better translation efficiency) | |
| """ | |
| if len(dna_sequence) % 3 != 0: | |
| raise ValueError("Sequence length must be multiple of 3") | |
| codons = [dna_sequence[i:i+3].upper() | |
| for i in range(0, len(dna_sequence), 3)] | |
| # Remove stop codons | |
| stop_codons = {'TAA', 'TAG', 'TGA'} | |
| codons = [c for c in codons if c not in stop_codons] | |
| if not codons: | |
| return 0.0 | |
| # Calculate geometric mean of weights | |
| weight_product = 1.0 | |
| valid_count = 0 | |
| for codon in codons: | |
| if codon in tai_weights: | |
| weight = tai_weights[codon] | |
| if weight > 0: | |
| weight_product *= weight | |
| valid_count += 1 | |
| if valid_count == 0: | |
| return 0.0 | |
| # Geometric mean | |
| tai = weight_product ** (1.0 / valid_count) | |
| return tai | |
| # Wobble base pairing penalties | |
| WOBBLE_PENALTIES = { | |
| 'GU': 0.0, # Strong wobble (no penalty) | |
| 'GC': 0.0, # Watson-Crick (no penalty) | |
| 'AU': 0.0, # Watson-Crick (no penalty) | |
| 'GA': 0.5, # Weak wobble | |
| 'CA': 0.5, # Weak wobble | |
| 'IU': 0.1, # Inosine wobble | |
| 'IC': 0.1, # Inosine wobble | |
| 'IA': 0.3, # Inosine wobble (weaker) | |
| }</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| tAI is considered more biologically accurate than CAI because it directly models the translation machinery's efficiency, not just codon frequency. | |
| </div> | |
| </div> | |
| <!-- Section 17: Negative Cis-Elements Detection --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">17.</span> Regulatory Motif Detection | |
| </div> | |
| <div class="description"> | |
| Detection of negative cis-regulatory elements (e.g., cryptic splice sites, premature polyadenylation signals, restriction sites) that could interfere with gene expression. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: CodonTransformer/CodonEvaluation.py</div> | |
| Lines 290-350 | Function: count_negative_cis_elements() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 17:</span> Cis-Element Scanning | |
| </div> | |
| <pre><code class="language-python">def count_negative_cis_elements(dna_sequence: str, | |
| organism: str = "ecoli") -> int: | |
| """ | |
| Detect negative cis-regulatory elements in DNA sequence. | |
| Scans for: | |
| - Cryptic splice sites (GT-AG, GC-AG) | |
| - Polyadenylation signals (AATAAA, ATTAAA) | |
| - Chi sites (GCTGGTGG for E. coli) | |
| - Restriction enzyme sites | |
| - Shine-Dalgarno sequences (ribosome binding sites) | |
| - Transcription terminator hairpins | |
| Args: | |
| dna_sequence: DNA sequence to scan | |
| organism: Target organism (affects motif set) | |
| Returns: | |
| Total count of problematic elements found | |
| """ | |
| dna_upper = dna_sequence.upper() | |
| element_count = 0 | |
| if organism == "ecoli": | |
| # E. coli-specific elements | |
| negative_motifs = { | |
| 'GCTGGTGG': 'Chi site (recombination hotspot)', | |
| 'AGGAGG': 'Strong Shine-Dalgarno (internal RBS)', | |
| 'AGGAG': 'Moderate Shine-Dalgarno', | |
| 'TATAAA': 'Promoter-like sequence', | |
| 'TTGACA': 'Promoter -35 box', | |
| 'TATAAT': 'Promoter -10 box', | |
| 'AAAAAAAA': 'Poly-A (8+)', | |
| 'CCCCCCCC': 'Poly-C (8+)', | |
| 'GGGGGGGG': 'Poly-G (8+) - G-quadruplex risk', | |
| 'TTTTTTTT': 'Poly-T (8+) - terminator', | |
| } | |
| else: | |
| # Eukaryotic elements | |
| negative_motifs = { | |
| 'AATAAA': 'Polyadenylation signal', | |
| 'ATTAAA': 'Alternative polyA signal', | |
| 'GTAAGT': 'Splice donor site', | |
| 'CAGG': 'Splice acceptor site', | |
| 'GGTAAG': 'Strong splice donor', | |
| } | |
| # Count occurrences of each motif | |
| for motif, description in negative_motifs.items(): | |
| count = dna_upper.count(motif) | |
| if count > 0: | |
| element_count += count | |
| print(f" Found {count}x {description}: {motif}") | |
| # Check for G/C homopolymer runs (length >= 6) | |
| import re | |
| homopolymers = re.findall(r'G{6,}|C{6,}', dna_upper) | |
| if homopolymers: | |
| element_count += len(homopolymers) | |
| # Check for complex secondary structures | |
| gc_content = get_GC_content(dna_sequence) | |
| if gc_content > 70: | |
| print(f" Warning: Very high GC content ({gc_content:.1f}%) may cause secondary structures") | |
| element_count += 1 | |
| return element_count</code></pre> | |
| </div> | |
| </div> | |
| <!-- Section 18: Streamlit GUI --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">18.</span> Interactive Web Interface | |
| </div> | |
| <div class="description"> | |
| The Streamlit-based GUI provides a user-friendly interface for sequence optimization, parameter tuning, and result visualization without requiring programming knowledge. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: streamlit_gui/app.py</div> | |
| Lines 1-100, 200-280 | Main Application | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 18:</span> Streamlit GUI Core | |
| </div> | |
| <pre><code class="language-python">import streamlit as st | |
| import torch | |
| from CodonTransformer.CodonPrediction import predict_dna_sequence | |
| from CodonTransformer.CodonEvaluation import ( | |
| get_CSI_value, calculate_tAI, get_GC_content | |
| ) | |
| # Configure page | |
| st.set_page_config( | |
| page_title="ENCOT GUI", | |
| layout="wide", | |
| initial_sidebar_state="expanded" | |
| ) | |
| # Initialize session state | |
| if 'model' not in st.session_state: | |
| st.session_state.model = None | |
| if 'tokenizer' not in st.session_state: | |
| st.session_state.tokenizer = None | |
| if 'results' not in st.session_state: | |
| st.session_state.results = None | |
| def main(): | |
| st.title("ENCOT: Enhanced Codon Optimization Tool") | |
| st.markdown("Transform protein sequences into optimized DNA for enhanced expression") | |
| # Sidebar: Model configuration | |
| with st.sidebar: | |
| st.header("βοΈ Configuration") | |
| model_choice = st.selectbox( | |
| "Model", | |
| ["saketh11/ColiFormer (89M params)", "Local checkpoint"] | |
| ) | |
| organism = st.selectbox( | |
| "Target Organism", | |
| ["Escherichia coli general", "Bacillus subtilis", | |
| "Homo sapiens", "Saccharomyces cerevisiae"] | |
| ) | |
| st.subheader("Generation Parameters") | |
| deterministic = st.checkbox("Deterministic", value=True) | |
| if not deterministic: | |
| temperature = st.slider("Temperature", 0.1, 2.0, 1.0, 0.1) | |
| top_p = st.slider("Top-p (nucleus sampling)", 0.1, 1.0, 0.9, 0.05) | |
| else: | |
| temperature = 1.0 | |
| top_p = 0.95 | |
| # GC content control | |
| use_constrained = st.checkbox("Constrained Beam Search", value=False) | |
| if use_constrained: | |
| gc_min = st.slider("Min GC%", 30, 70, 45, 1) / 100 | |
| gc_max = st.slider("Max GC%", 30, 70, 60, 1) / 100 | |
| beam_size = st.slider("Beam Size", 2, 20, 5, 1) | |
| # Main area: Input | |
| st.header("π Input Protein Sequence") | |
| protein_input = st.text_area( | |
| "Enter protein sequence (FASTA or plain text)", | |
| height=150, | |
| placeholder=">my_protein\nMKTAYIAKQRQISFVKSHF..." | |
| ) | |
| # Parse FASTA if provided | |
| if protein_input.startswith('>'): | |
| lines = protein_input.strip().split('\n') | |
| protein_seq = ''.join(lines[1:]) | |
| else: | |
| protein_seq = protein_input.replace(' ', '').replace('\n', '') | |
| # Optimization button | |
| if st.button("π Optimize Sequence", type="primary"): | |
| if not protein_seq: | |
| st.error("Please enter a protein sequence") | |
| return | |
| with st.spinner("Optimizing codon usage..."): | |
| # Load model | |
| if st.session_state.model is None: | |
| with st.spinner("Loading model (first time only)..."): | |
| from CodonTransformer.CodonPrediction import load_model, load_tokenizer | |
| st.session_state.model = load_model(model_choice) | |
| st.session_state.tokenizer = load_tokenizer() | |
| # Generate optimized DNA | |
| result = predict_dna_sequence( | |
| protein=protein_seq, | |
| organism=organism, | |
| model=st.session_state.model, | |
| tokenizer=st.session_state.tokenizer, | |
| deterministic=deterministic, | |
| temperature=temperature, | |
| top_p=top_p, | |
| use_constrained_search=use_constrained, | |
| gc_bounds=(gc_min, gc_max) if use_constrained else None, | |
| beam_size=beam_size if use_constrained else 1 | |
| ) | |
| st.session_state.results = result | |
| # Display results | |
| if st.session_state.results: | |
| display_results(st.session_state.results, protein_seq, organism) | |
| if __name__ == "__main__": | |
| main()</code></pre> | |
| </div> | |
| </div> | |
| <!-- Section 19: Benchmark Evaluation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">19.</span> Benchmarking Framework | |
| </div> | |
| <div class="description"> | |
| Comprehensive evaluation framework comparing ENCOT against baseline methods (uniform sampling, natural sequences, frequency-based optimization) across multiple metrics. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: benchmark_evaluation.py</div> | |
| Lines 150-250 | Function: run_benchmark_suite() | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 19:</span> Benchmark Pipeline | |
| </div> | |
| <pre><code class="language-python">def run_benchmark_suite(test_sequences: List[Dict], | |
| model, tokenizer, organism: str): | |
| """ | |
| Run comprehensive benchmark evaluation. | |
| Compares: | |
| 1. ENCOT (deterministic) | |
| 2. ENCOT (stochastic, T=1.0) | |
| 3. ENCOT (constrained beam search) | |
| 4. Uniform codon sampling (baseline) | |
| 5. Natural E. coli sequences (reference) | |
| 6. Frequency-based optimization | |
| Metrics evaluated: | |
| - CAI (Codon Adaptation Index) | |
| - tAI (tRNA Adaptation Index) | |
| - GC content (% and variance) | |
| - Negative cis-elements | |
| - Homopolymer runs | |
| - Sequence diversity (edit distance between replicates) | |
| Args: | |
| test_sequences: List of protein sequences | |
| model: Trained ENCOT model | |
| tokenizer: Codon tokenizer | |
| organism: Target organism | |
| Returns: | |
| Pandas DataFrame with benchmark results | |
| """ | |
| import pandas as pd | |
| from tqdm import tqdm | |
| results = [] | |
| for seq_data in tqdm(test_sequences, desc="Benchmarking"): | |
| protein = seq_data['protein_sequence'] | |
| seq_id = seq_data['id'] | |
| # Method 1: ENCOT deterministic | |
| encot_det = predict_dna_sequence( | |
| protein=protein, | |
| organism=organism, | |
| model=model, | |
| tokenizer=tokenizer, | |
| deterministic=True | |
| ) | |
| # Method 2: ENCOT stochastic (5 samples) | |
| encot_stoch = [ | |
| predict_dna_sequence( | |
| protein=protein, | |
| organism=organism, | |
| model=model, | |
| tokenizer=tokenizer, | |
| deterministic=False, | |
| temperature=1.0 | |
| ) | |
| for _ in range(5) | |
| ] | |
| # Method 3: ENCOT constrained | |
| encot_constrained = predict_dna_sequence( | |
| protein=protein, | |
| organism=organism, | |
| model=model, | |
| tokenizer=tokenizer, | |
| use_constrained_search=True, | |
| gc_bounds=(0.45, 0.60), | |
| beam_size=5 | |
| ) | |
| # Method 4: Uniform baseline | |
| uniform = generate_uniform_codon_sequence(protein) | |
| # Method 5: Natural sequence (if available) | |
| natural = seq_data.get('natural_dna', None) | |
| # Method 6: Frequency-based | |
| freq_based = generate_frequency_optimized(protein, organism) | |
| # Evaluate all methods | |
| methods = { | |
| 'ENCOT_det': encot_det, | |
| 'ENCOT_stoch_mean': encot_stoch[0], # Take first for single eval | |
| 'ENCOT_constrained': encot_constrained, | |
| 'Uniform_baseline': uniform, | |
| 'Natural': natural, | |
| 'Frequency_based': freq_based | |
| } | |
| for method_name, dna in methods.items(): | |
| if dna is None: | |
| continue | |
| # Calculate metrics | |
| cai = get_CSI_value(dna, cai_weights) | |
| tai = calculate_tAI(dna, tai_weights) | |
| gc = get_GC_content(dna) | |
| cis_elements = count_negative_cis_elements(dna) | |
| gc_var = calculate_gc_variance(dna, window_size=100) | |
| results.append({ | |
| 'sequence_id': seq_id, | |
| 'method': method_name, | |
| 'CAI': cai, | |
| 'tAI': tai, | |
| 'GC_content': gc, | |
| 'GC_variance': gc_var, | |
| 'negative_cis': cis_elements, | |
| 'sequence_length': len(dna) | |
| }) | |
| # Convert to DataFrame and compute statistics | |
| df = pd.DataFrame(results) | |
| # Group statistics | |
| summary = df.groupby('method').agg({ | |
| 'CAI': ['mean', 'std'], | |
| 'tAI': ['mean', 'std'], | |
| 'GC_content': ['mean', 'std'], | |
| 'negative_cis': ['mean', 'sum'] | |
| }) | |
| print("\n" + "="*60) | |
| print("BENCHMARK RESULTS") | |
| print("="*60) | |
| print(summary) | |
| return df, summary</code></pre> | |
| </div> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Method</th> | |
| <th>CAI β</th> | |
| <th>tAI β</th> | |
| <th>GC% Target</th> | |
| <th>Cis Elements β</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><strong>ENCOT (ALM)</strong></td> | |
| <td><strong>0.87 Β± 0.04</strong></td> | |
| <td><strong>0.52 Β± 0.06</strong></td> | |
| <td><strong>52.1 Β± 0.8%</strong></td> | |
| <td><strong>1.2 Β± 0.9</strong></td> | |
| </tr> | |
| <tr> | |
| <td>ENCOT (constrained)</td> | |
| <td>0.84 Β± 0.05</td> | |
| <td>0.50 Β± 0.07</td> | |
| <td>52.5 Β± 0.3%</td> | |
| <td>0.8 Β± 0.7</td> | |
| </tr> | |
| <tr> | |
| <td>Frequency-based</td> | |
| <td>0.79 Β± 0.08</td> | |
| <td>0.45 Β± 0.09</td> | |
| <td>51.8 Β± 3.2%</td> | |
| <td>3.5 Β± 2.1</td> | |
| </tr> | |
| <tr> | |
| <td>Uniform baseline</td> | |
| <td>0.62 Β± 0.11</td> | |
| <td>0.38 Β± 0.10</td> | |
| <td>50.2 Β± 5.8%</td> | |
| <td>8.3 Β± 3.4</td> | |
| </tr> | |
| <tr> | |
| <td>Natural E. coli</td> | |
| <td>0.75 Β± 0.12</td> | |
| <td>0.48 Β± 0.11</td> | |
| <td>51.2 Β± 4.1%</td> | |
| <td>2.1 Β± 1.5</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <!-- Section 20: Data Preparation --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">20.</span> Training Data Pipeline | |
| </div> | |
| <div class="description"> | |
| The data preparation pipeline processes E. coli genome sequences, validates them, filters by quality metrics, and creates training/validation splits for model fine-tuning. | |
| </div> | |
| <div class="file-ref"> | |
| <div class="file-path">File: prepare_ecoli_data.py</div> | |
| Lines 50-200 | Data Processing Functions | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 20:</span> Data Preparation Pipeline | |
| </div> | |
| <pre><code class="language-python">def prepare_training_data(genome_file: str, output_dir: str): | |
| """ | |
| Prepare E. coli training data from genome sequences. | |
| Pipeline: | |
| 1. Load genome sequences (GenBank or FASTA) | |
| 2. Extract coding sequences (CDSs) | |
| 3. Validate sequences (start codon, stop codon, length) | |
| 4. Filter by quality metrics: | |
| - CAI > 0.5 | |
| - Length: 300-3000 bp | |
| - No frameshifts | |
| - No ambiguous bases | |
| 5. Split into training/validation/test sets (80/10/10) | |
| 6. Create codon-tokenized format | |
| 7. Save as JSON with metadata | |
| Args: | |
| genome_file: Path to GenBank/FASTA genome file | |
| output_dir: Directory for processed data | |
| Returns: | |
| Dictionary with dataset statistics | |
| """ | |
| from Bio import SeqIO | |
| import json | |
| print("Loading genome sequences...") | |
| sequences = [] | |
| for record in SeqIO.parse(genome_file, "genbank"): | |
| for feature in record.features: | |
| if feature.type == "CDS": | |
| # Extract DNA and protein sequence | |
| dna = str(feature.location.extract(record.seq)) | |
| try: | |
| protein = str(feature.qualifiers['translation'][0]) | |
| except: | |
| continue | |
| # Validate sequence | |
| if not validate_sequence(dna, protein): | |
| continue | |
| # Calculate quality metrics | |
| cai = get_CSI_value(dna, ecoli_cai_weights) | |
| gc = get_GC_content(dna) | |
| # Filter by quality | |
| if cai < 0.5: # Low CAI, skip | |
| continue | |
| if len(dna) < 300 or len(dna) > 3000: # Too short/long | |
| continue | |
| if gc < 40 or gc > 65: # Extreme GC content | |
| continue | |
| # Get gene metadata | |
| gene_id = feature.qualifiers.get('locus_tag', ['unknown'])[0] | |
| gene_name = feature.qualifiers.get('gene', [''])[0] | |
| product = feature.qualifiers.get('product', [''])[0] | |
| sequences.append({ | |
| 'id': gene_id, | |
| 'gene_name': gene_name, | |
| 'product': product, | |
| 'protein_sequence': protein, | |
| 'dna_sequence': dna, | |
| 'length_bp': len(dna), | |
| 'length_aa': len(protein), | |
| 'CAI': float(cai), | |
| 'GC_content': float(gc) | |
| }) | |
| print(f"Extracted {len(sequences)} valid CDSs") | |
| # Split into train/val/test | |
| import random | |
| random.shuffle(sequences) | |
| n_train = int(0.8 * len(sequences)) | |
| n_val = int(0.1 * len(sequences)) | |
| train_data = sequences[:n_train] | |
| val_data = sequences[n_train:n_train + n_val] | |
| test_data = sequences[n_train + n_val:] | |
| # Save datasets | |
| with open(f"{output_dir}/train_set.json", 'w') as f: | |
| json.dump(train_data, f, indent=2) | |
| with open(f"{output_dir}/val_set.json", 'w') as f: | |
| json.dump(val_data, f, indent=2) | |
| with open(f"{output_dir}/test_set.json", 'w') as f: | |
| json.dump(test_data, f, indent=2) | |
| # Statistics | |
| stats = { | |
| 'total_sequences': len(sequences), | |
| 'train_size': len(train_data), | |
| 'val_size': len(val_data), | |
| 'test_size': len(test_data), | |
| 'mean_cai': np.mean([s['CAI'] for s in sequences]), | |
| 'mean_gc': np.mean([s['GC_content'] for s in sequences]), | |
| 'mean_length': np.mean([s['length_bp'] for s in sequences]) | |
| } | |
| print("\nDataset Statistics:") | |
| print(json.dumps(stats, indent=2)) | |
| return stats | |
| def validate_sequence(dna: str, protein: str) -> bool: | |
| """Validate DNA-protein pair integrity""" | |
| # Check start codon | |
| if not dna.upper().startswith('ATG'): | |
| return False | |
| # Check stop codon | |
| stop_codons = ['TAA', 'TAG', 'TGA'] | |
| if not any(dna.upper().endswith(sc) for sc in stop_codons): | |
| return False | |
| # Check length match | |
| if len(dna) != (len(protein) + 1) * 3: # +1 for stop codon | |
| return False | |
| # Verify translation | |
| from Bio.Seq import Seq | |
| translated = str(Seq(dna).translate(to_stop=True)) | |
| if translated != protein: | |
| return False | |
| # Check for ambiguous bases | |
| if any(base not in 'ATGC' for base in dna.upper()): | |
| return False | |
| return True</code></pre> | |
| </div> | |
| <div class="handwritten-note"> | |
| Quality filtering ensures the model learns from well-adapted, biologically meaningful sequences rather than noisy genome data. | |
| </div> | |
| </div> | |
| <!-- Section 21: Architecture Overview (was Section 10) --> | |
| <div class="section"> | |
| <div class="section-title"> | |
| <span class="section-number">21.</span> System Architecture | |
| </div> | |
| <div class="description"> | |
| The ENCOT system is organized into modular components that handle different aspects of the | |
| optimization pipeline. This architecture promotes code reusability and maintainability. | |
| </div> | |
| <div class="code-container"> | |
| <div class="code-header"> | |
| <span class="listing-number">Listing 21:</span> Project Structure | |
| </div> | |
| <pre><code class="language-plaintext">ENCOT/ | |
| β | |
| βββ CodonTransformer/ # Core library modules | |
| β βββ __init__.py | |
| β βββ CodonPrediction.py # Model loading & inference [1373 lines] | |
| β βββ CodonEvaluation.py # Metrics computation [584 lines] | |
| β βββ CodonData.py # Data preprocessing [683 lines] | |
| β βββ CodonUtils.py # Constants & utilities [872 lines] | |
| β βββ CodonJupyter.py # Notebook helpers | |
| β βββ CodonPostProcessing.py # DNA-Chisel integration | |
| β | |
| βββ scripts/ # Command-line interfaces | |
| β βββ train.py # Training wrapper | |
| β βββ optimize_sequence.py # Sequence optimization CLI | |
| β βββ run_benchmarks.py # Benchmark evaluation | |
| β βββ preprocess_data.py # Data preparation | |
| β | |
| βββ configs/ # Training configurations | |
| β βββ train_ecoli_alm.yaml # Main ALM config | |
| β βββ train_ecoli_quick.yaml # Quick test config | |
| β | |
| βββ streamlit_gui/ # Web interface | |
| β βββ app.py # Main Streamlit app [1457 lines] | |
| β βββ demo.py # Demo script | |
| β βββ run_gui.py # Launcher | |
| β βββ test_gui.py # Test suite | |
| β | |
| βββ data/ # Datasets | |
| β βββ finetune_set.json # Training data (4,300 sequences) | |
| β βββ test_set.json # Test data (100 sequences) | |
| β βββ ecoli_processed_genes.csv # Reference sequences | |
| β | |
| βββ tests/ # Test suite | |
| β βββ test_CodonUtils.py | |
| β βββ test_CodonData.py | |
| β βββ test_CodonPrediction.py | |
| β βββ test_CodonEvaluation.py | |
| β | |
| βββ finetune.py # Main training script [734 lines] | |
| βββ benchmark_evaluation.py # Evaluation script [696 lines] | |
| βββ prepare_ecoli_data.py # Data validation | |
| βββ setup.py # Package installation | |
| βββ pyproject.toml # Project metadata | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # Documentation | |
| Key Components (Lines of Code): | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| CodonPrediction.py 1,373 lines Inference engine | |
| CodonEvaluation.py 584 lines Metrics | |
| CodonData.py 683 lines Data handling | |
| CodonUtils.py 872 lines Utilities | |
| finetune.py 734 lines Training | |
| benchmark_evaluation.py 696 lines Evaluation | |
| streamlit_gui/app.py 1,457 lines Web GUI | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| TOTAL 6,399 lines | |
| Core Innovations: | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| Augmented-Lagrangian Method (ALM) for GC control | |
| β’ Adaptive penalty coefficients | |
| β’ Curriculum learning | |
| β’ Self-tuning multipliers | |
| Constrained beam search with GC bounds | |
| β’ Real-time GC monitoring during generation | |
| β’ Pruning of non-compliant candidates | |
| Multi-metric evaluation framework | |
| β’ CAI, tAI, GC content | |
| β’ Negative cis-elements detection | |
| β’ Homopolymer analysis</code></pre> | |
| </div> | |
| </div> | |
| <!-- Footer --> | |
| <script> | |
| // Initialize syntax highlighting | |
| hljs.highlightAll(); | |
| </script> | |
| </body> | |
| </html> |