Spaces:
Sleeping
Sleeping
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>ENCOT - Key Code Sections</title> | |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/github-dark.min.css"> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/python.min.js"></script> | |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/languages/yaml.min.js"></script> | |
| <style> | |
| body { | |
| font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; | |
| max-width: 1200px; | |
| margin: 0 auto; | |
| padding: 20px; | |
| background: #0d1117; | |
| color: #c9d1d9; | |
| } | |
| .header { | |
| text-align: center; | |
| padding: 40px 0; | |
| background: linear-gradient(135deg, #1f6feb 0%, #58a6ff 100%); | |
| border-radius: 10px; | |
| margin-bottom: 30px; | |
| } | |
| .header h1 { | |
| margin: 0; | |
| color: white; | |
| font-size: 3em; | |
| text-shadow: 2px 2px 4px rgba(0,0,0,0.3); | |
| } | |
| .header p { | |
| color: rgba(255,255,255,0.9); | |
| font-size: 1.2em; | |
| margin: 10px 0 0 0; | |
| } | |
| .section { | |
| background: #161b22; | |
| border: 1px solid #30363d; | |
| border-radius: 8px; | |
| margin: 30px 0; | |
| padding: 25px; | |
| page-break-inside: avoid; | |
| } | |
| .section-title { | |
| color: #58a6ff; | |
| font-size: 1.8em; | |
| margin: 0 0 10px 0; | |
| padding-bottom: 10px; | |
| border-bottom: 2px solid #21262d; | |
| } | |
| .section-number { | |
| display: inline-block; | |
| background: #1f6feb; | |
| color: white; | |
| padding: 5px 15px; | |
| border-radius: 20px; | |
| font-size: 0.8em; | |
| margin-right: 10px; | |
| } | |
| .description { | |
| color: #8b949e; | |
| margin: 15px 0; | |
| font-size: 1.1em; | |
| line-height: 1.6; | |
| } | |
| .file-info { | |
| background: #0d1117; | |
| padding: 10px 15px; | |
| border-radius: 5px; | |
| margin: 15px 0; | |
| border-left: 4px solid #1f6feb; | |
| } | |
| .file-path { | |
| color: #58a6ff; | |
| font-family: 'Consolas', 'Monaco', monospace; | |
| } | |
| .line-range { | |
| color: #8b949e; | |
| font-size: 0.9em; | |
| } | |
| .highlight-note { | |
| background: #ffd33d; | |
| color: #1f2328; | |
| padding: 3px 8px; | |
| border-radius: 3px; | |
| font-weight: bold; | |
| font-size: 0.9em; | |
| } | |
| pre { | |
| margin: 15px 0; | |
| border-radius: 6px; | |
| overflow-x: auto; | |
| } | |
| pre code { | |
| font-family: 'Consolas', 'Monaco', 'Courier New', monospace; | |
| font-size: 14px; | |
| line-height: 1.5; | |
| } | |
| .key-feature { | |
| background: #1f6feb; | |
| color: white; | |
| padding: 15px; | |
| border-radius: 5px; | |
| margin: 15px 0; | |
| } | |
| .footer { | |
| text-align: center; | |
| margin-top: 50px; | |
| padding: 20px; | |
| color: #8b949e; | |
| border-top: 1px solid #21262d; | |
| } | |
| @media print { | |
| body { | |
| background: white; | |
| color: black; | |
| } | |
| .section { | |
| border: 1px solid #ccc; | |
| page-break-inside: avoid; | |
| } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="header"> | |
| <h1>𧬠ENCOT</h1> | |
| <p>Enhanced Codon Optimization Tool - Key Code Sections</p> | |
| </div> | |
| <!-- Section 1: ALM Training Class --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">1</span> | |
| ALM Training Harness - Core Innovation | |
| </h2> | |
| <div class="description"> | |
| The PyTorch Lightning training harness implementing the Augmented-Lagrangian Method (ALM) | |
| for precise GC content control during fine-tuning. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π finetune.py</div> | |
| <div class="line-range">Lines 73-148 | Class Definition & Initialization</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> ALM parameters initialization including lagrangian multipliers, | |
| adaptive penalty coefficients, and curriculum learning setup | |
| </div> | |
| <pre><code class="language-python">class plTrainHarness(pl.LightningModule): | |
| """ | |
| PyTorch Lightning training harness for ENCOT with Augmented-Lagrangian Method (ALM) GC control. | |
| This class implements the training loop for fine-tuning CodonTransformer on E. coli sequences | |
| with precise GC content control using an Augmented-Lagrangian Method. The ALM approach allows | |
| the model to learn codon preferences while maintaining GC content within a target range (e.g., 52%). | |
| Key features: | |
| - Masked language modeling (MLM) loss for codon prediction | |
| - ALM-based GC content constraint enforcement | |
| - Curriculum learning: warm-up epochs before enforcing GC constraints | |
| - Adaptive penalty coefficient (rho) adjustment based on constraint violation progress | |
| The ALM method minimizes: L = L_MLM + λ·(GC - ΞΌ) + (Ο/2)(GC - ΞΌ)Β² | |
| where Ξ» is the Lagrangian multiplier and Ο is the penalty coefficient. | |
| """ | |
| def __init__(self, model, learning_rate, warmup_fraction, gc_penalty_weight, tokenizer, | |
| gc_target=0.52, use_lagrangian=False, lagrangian_rho=10.0, curriculum_epochs=3, | |
| alm_tolerance=1e-5, alm_dual_tolerance=1e-5, alm_penalty_update_factor=10.0, | |
| alm_initial_penalty_factor=20.0, alm_tolerance_update_factor=0.1, | |
| alm_rel_penalty_increase_threshold=0.1, alm_max_penalty=1e6, alm_min_penalty=1e-6): | |
| super().__init__() | |
| self.model = model | |
| self.learning_rate = learning_rate | |
| self.warmup_fraction = warmup_fraction | |
| self.gc_penalty_weight = gc_penalty_weight | |
| self.tokenizer = tokenizer | |
| # Augmented-Lagrangian GC Control parameters | |
| self.gc_target = gc_target | |
| self.use_lagrangian = use_lagrangian | |
| self.lagrangian_rho = lagrangian_rho | |
| self.curriculum_epochs = curriculum_epochs | |
| # Enhanced ALM parameters (inspired by alpaqa research) | |
| self.alm_tolerance = alm_tolerance | |
| self.alm_dual_tolerance = alm_dual_tolerance | |
| self.alm_penalty_update_factor = alm_penalty_update_factor | |
| self.alm_initial_penalty_factor = alm_initial_penalty_factor | |
| self.alm_tolerance_update_factor = alm_tolerance_update_factor | |
| self.alm_rel_penalty_increase_threshold = alm_rel_penalty_increase_threshold | |
| self.alm_max_penalty = alm_max_penalty | |
| self.alm_min_penalty = alm_min_penalty | |
| # Initialize Lagrangian multiplier as buffer (persists across checkpoints) | |
| self.register_buffer("lambda_gc", torch.tensor(0.0)) | |
| # Adaptive penalty coefficient (rho) - starts as parameter, becomes adaptive | |
| self.register_buffer("rho_adaptive", torch.tensor(self.lagrangian_rho)) | |
| # Step counter for periodic lambda updates | |
| self.register_buffer("step_counter", torch.tensor(0)) | |
| # ALM convergence tracking | |
| self.register_buffer("previous_constraint_violation", torch.tensor(float('inf'))) | |
| </code></pre> | |
| </div> | |
| <!-- Section 2: Training Step with ALM Loss --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">2</span> | |
| Training Step - ALM Loss Calculation | |
| </h2> | |
| <div class="description"> | |
| The training step that combines MLM loss with Lagrangian-based GC constraint enforcement. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π finetune.py</div> | |
| <div class="line-range">Lines 150-230 | training_step method</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Calculation of gc_constraint, lagrangian_loss with adaptive penalties | |
| </div> | |
| <pre><code class="language-python"> def training_step(self, batch, batch_idx): | |
| outputs = self.model(**batch) | |
| mlm_loss = outputs.loss | |
| # Enhanced Lagrangian-based GC penalty | |
| if self.use_lagrangian and self.current_epoch >= self.curriculum_epochs: | |
| # Compute GC content from logits | |
| logits = outputs.logits | |
| predicted_tokens = torch.argmax(logits, dim=-1) | |
| # Calculate GC content per sequence | |
| gc_content_batch = [] | |
| for seq_tokens in predicted_tokens: | |
| valid_tokens = seq_tokens[seq_tokens >= 26] | |
| if len(valid_tokens) == 0: | |
| gc_content_batch.append(self.gc_target) | |
| continue | |
| gc_counts = sum(1 for token in valid_tokens if token.item() in G_indices + C_indices) | |
| gc_content = gc_counts / len(valid_tokens) | |
| gc_content_batch.append(gc_content) | |
| gc_content_mean = sum(gc_content_batch) / len(gc_content_batch) | |
| # Compute GC constraint violation | |
| gc_constraint = gc_content_mean - self.gc_target | |
| # Augmented Lagrangian loss term | |
| lagrangian_loss = ( | |
| self.lambda_gc * gc_constraint + | |
| (self.rho_adaptive / 2) * (gc_constraint ** 2) | |
| ) | |
| total_loss = mlm_loss + lagrangian_loss | |
| # Log metrics | |
| self.log("train/mlm_loss", mlm_loss, prog_bar=True) | |
| self.log("train/gc_constraint", gc_constraint, prog_bar=True) | |
| self.log("train/lagrangian_loss", lagrangian_loss, prog_bar=False) | |
| self.log("train/lambda_gc", self.lambda_gc, prog_bar=False) | |
| self.log("train/rho", self.rho_adaptive, prog_bar=False) | |
| self.log("train/gc_content", gc_content_mean, prog_bar=True) | |
| # Update Lagrangian multiplier periodically | |
| self.step_counter += 1 | |
| if self.step_counter % 20 == 0: | |
| self._update_alm_parameters(gc_constraint) | |
| else: | |
| total_loss = mlm_loss | |
| self.log("train/mlm_loss", mlm_loss, prog_bar=True) | |
| self.log("train/total_loss", total_loss, prog_bar=True) | |
| return total_loss | |
| </code></pre> | |
| </div> | |
| <!-- Section 3: Adaptive Penalty Update --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">3</span> | |
| Adaptive ALM Parameter Updates | |
| </h2> | |
| <div class="description"> | |
| Self-tuning mechanism that adjusts Lagrangian multipliers and penalty coefficients based on constraint violation progress. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π finetune.py</div> | |
| <div class="line-range">Lines 260-320 | _update_alm_parameters method</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Adaptive penalty adjustment logic - increases penalty if violations don't improve | |
| </div> | |
| <pre><code class="language-python"> def _update_alm_parameters(self, gc_constraint): | |
| """ | |
| Update Lagrangian multiplier and penalty coefficient according to ALM rules. | |
| This implements the adaptive penalty update strategy: | |
| - If constraint violation is decreasing sufficiently, update lambda and keep rho | |
| - If constraint violation is not improving, increase rho (penalty coefficient) | |
| """ | |
| constraint_violation = abs(gc_constraint.item()) | |
| # Check if we're making sufficient progress | |
| relative_improvement = ( | |
| (self.previous_constraint_violation - constraint_violation) / | |
| max(self.previous_constraint_violation, 1e-8) | |
| ) | |
| if constraint_violation <= self.alm_tolerance: | |
| # Constraint satisfied - update lambda, optionally reduce rho | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| # Could reduce rho here if desired, but keeping it stable works well | |
| elif relative_improvement < self.alm_rel_penalty_increase_threshold: | |
| # Not making enough progress - increase penalty | |
| self.rho_adaptive = torch.clamp( | |
| self.rho_adaptive * self.alm_penalty_update_factor, | |
| min=self.alm_min_penalty, | |
| max=self.alm_max_penalty | |
| ) | |
| # Also update lambda | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| else: | |
| # Making good progress - just update lambda | |
| self.lambda_gc = self.lambda_gc + self.rho_adaptive * gc_constraint | |
| # Update tracking | |
| self.previous_constraint_violation = torch.tensor(constraint_violation) | |
| </code></pre> | |
| </div> | |
| <!-- Section 4: Main Prediction Function --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">4</span> | |
| DNA Sequence Prediction Function | |
| </h2> | |
| <div class="description"> | |
| The main inference function that optimizes protein sequences to DNA with support for constrained beam search and GC content bounds. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π CodonTransformer/CodonPrediction.py</div> | |
| <div class="line-range">Lines 38-120 | predict_dna_sequence function signature</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Function parameters including use_constrained_search and gc_bounds | |
| </div> | |
| <pre><code class="language-python">def predict_dna_sequence( | |
| protein: str, | |
| organism: Union[int, str], | |
| device: torch.device, | |
| tokenizer: Union[str, PreTrainedTokenizerFast] = None, | |
| model: Union[str, torch.nn.Module] = None, | |
| attention_type: str = "original_full", | |
| deterministic: bool = True, | |
| temperature: float = 0.2, | |
| top_p: float = 0.95, | |
| num_sequences: int = 1, | |
| match_protein: bool = False, | |
| use_constrained_search: bool = False, | |
| gc_bounds: Tuple[float, float] = (0.30, 0.70), | |
| beam_size: int = 5, | |
| length_penalty: float = 1.0, | |
| diversity_penalty: float = 0.0, | |
| ) -> Union[DNASequencePrediction, List[DNASequencePrediction]]: | |
| """ | |
| Predict the DNA sequence(s) for a given protein using the ENCOT model. | |
| This function takes a protein sequence and an organism (as ID or name) as input | |
| and returns the predicted DNA sequence(s) using the ENCOT model. It can use | |
| either provided tokenizer and model objects or load them from specified paths. | |
| Args: | |
| protein (str): The input protein sequence for which to predict the DNA sequence. | |
| organism (Union[int, str]): Either the ID of the organism or its name (e.g., | |
| "Escherichia coli general"). | |
| device (torch.device): The device (CPU or GPU) to run the model on. | |
| use_constrained_search (bool, optional): Enable constrained beam search with GC bounds. | |
| gc_bounds (Tuple[float, float], optional): GC content bounds (min, max) for | |
| constrained search. Defaults to (0.30, 0.70). | |
| beam_size (int, optional): Beam size for beam search. Defaults to 5. | |
| Returns: | |
| Union[DNASequencePrediction, List[DNASequencePrediction]]: Predicted DNA sequence(s) | |
| with associated metrics. | |
| """ | |
| </code></pre> | |
| </div> | |
| <!-- Section 5: Evaluation Metrics --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">5</span> | |
| Evaluation Metrics - CAI & tAI | |
| </h2> | |
| <div class="description"> | |
| Functions for calculating Codon Adaptation Index (CAI) and tRNA Adaptation Index (tAI), | |
| key metrics for evaluating codon optimization quality. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π CodonTransformer/CodonEvaluation.py</div> | |
| <div class="line-range">Lines 23-50, 370-420 | Metrics functions</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> CAI and tAI calculation implementations | |
| </div> | |
| <pre><code class="language-python">def get_CSI_weights(sequences: List[str]) -> Dict[str, float]: | |
| """ | |
| Calculate the Codon Similarity Index (CSI) weights for a list of DNA sequences. | |
| Args: | |
| sequences (List[str]): List of DNA sequences. | |
| Returns: | |
| dict: The CSI weights. | |
| """ | |
| return relative_adaptiveness(sequences=sequences) | |
| def get_CSI_value(dna: str, weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate the Codon Similarity Index (CSI) for a DNA sequence. | |
| Args: | |
| dna (str): The DNA sequence. | |
| weights (dict): The CSI weights from get_CSI_weights. | |
| Returns: | |
| float: The CSI value. | |
| """ | |
| return CAI(dna, weights) | |
| def get_ecoli_tai_weights(): | |
| """ | |
| Returns pre-calculated tAI weights for E. coli K-12 MG1655. | |
| These weights are based on tRNA gene copy numbers and wobble base pairing rules. | |
| """ | |
| return { | |
| 'TTT': 0.58, 'TTC': 0.42, 'TTA': 0.13, 'TTG': 0.13, | |
| 'TCT': 0.15, 'TCC': 0.15, 'TCA': 0.12, 'TCG': 0.15, | |
| # ... full codon table | |
| } | |
| def calculate_tAI(sequence: str, tai_weights: Dict[str, float]) -> float: | |
| """ | |
| Calculate the tRNA Adaptation Index (tAI) for a DNA sequence. | |
| Args: | |
| sequence (str): DNA sequence (must be divisible by 3) | |
| tai_weights (Dict[str, float]): tAI weights for each codon | |
| Returns: | |
| float: Geometric mean of tAI weights for all codons in the sequence | |
| """ | |
| if len(sequence) % 3 != 0: | |
| raise ValueError("Sequence length must be divisible by 3") | |
| codons = [sequence[i:i+3].upper() for i in range(0, len(sequence), 3)] | |
| weights = [tai_weights.get(codon, 0.5) for codon in codons if codon not in ['TAA', 'TAG', 'TGA']] | |
| if not weights: | |
| return 0.0 | |
| # Geometric mean | |
| product = 1.0 | |
| for w in weights: | |
| product *= w | |
| return product ** (1.0 / len(weights)) | |
| </code></pre> | |
| </div> | |
| <!-- Section 6: Training Configuration --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">6</span> | |
| Training Configuration - ALM Settings | |
| </h2> | |
| <div class="description"> | |
| YAML configuration file defining all training hyperparameters, including ALM-specific settings for GC content control. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π configs/train_ecoli_alm.yaml</div> | |
| <div class="line-range">Complete file | Training configuration</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> ALM section with gc_target, curriculum_epochs, and penalty parameters | |
| </div> | |
| <pre><code class="language-yaml"># ENCOT ALM Training Configuration | |
| # This configuration reproduces the main training setup from the paper | |
| # using the Augmented-Lagrangian Method (ALM) for GC content control. | |
| model: | |
| base_model: "adibvafa/CodonTransformer-base" | |
| tokenizer: "adibvafa/CodonTransformer" | |
| data: | |
| dataset_dir: "data" | |
| # Expected files: finetune_set.json (created by preprocess_data.py) | |
| training: | |
| batch_size: 6 | |
| max_epochs: 15 | |
| learning_rate: 5e-5 | |
| warmup_fraction: 0.1 | |
| num_workers: 5 | |
| accumulate_grad_batches: 1 | |
| num_gpus: 4 | |
| save_every_n_steps: 512 | |
| seed: 123 | |
| log_every_n_steps: 20 | |
| checkpoint: | |
| checkpoint_dir: "models/alm-enhanced-training" | |
| checkpoint_filename: "balanced_alm_finetune.ckpt" | |
| # Augmented-Lagrangian Method (ALM) for GC content control | |
| alm: | |
| enabled: true | |
| gc_target: 0.52 # Target GC content for E. coli (52%) | |
| curriculum_epochs: 3 # Warm-up epochs before enforcing GC constraint | |
| # ALM penalty parameters | |
| initial_penalty_factor: 20.0 | |
| penalty_update_factor: 10.0 | |
| max_penalty: 1e6 | |
| min_penalty: 1e-6 | |
| # ALM tolerance parameters | |
| tolerance: 1e-5 # Primal tolerance | |
| dual_tolerance: 1e-5 # Dual tolerance for constraint violation | |
| tolerance_update_factor: 0.1 | |
| # Adaptive penalty adjustment | |
| rel_penalty_increase_threshold: 0.1 | |
| # Legacy penalty method (if ALM disabled) | |
| gc_penalty: | |
| weight: 0.0 # Only used if use_lagrangian=false | |
| </code></pre> | |
| </div> | |
| <!-- Section 7: Data Preparation --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">7</span> | |
| Data Preparation & Validation | |
| </h2> | |
| <div class="description"> | |
| Functions for validating and preparing E. coli gene sequences for training, including sequence validation checks. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π prepare_ecoli_data.py</div> | |
| <div class="line-range">Lines 5-30 | Validation function</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Sequence validation rules (start/stop codons, frame, no internal stops) | |
| </div> | |
| <pre><code class="language-python">def is_valid_sequence(dna_seq: str) -> bool: | |
| """ | |
| Applies a series of validation checks to a DNA sequence. | |
| Args: | |
| dna_seq (str): The DNA sequence to validate. | |
| Returns: | |
| bool: True if the sequence is valid, False otherwise. | |
| """ | |
| # Check if length is divisible by 3 (valid codon frame) | |
| if len(dna_seq) % 3 != 0: | |
| return False | |
| # Check for valid start codon | |
| if not dna_seq.upper().startswith(('ATG', 'TTG', 'CTG', 'GTG')): | |
| return False | |
| # Check for valid stop codon | |
| if not dna_seq.upper().endswith(('TAA', 'TAG', 'TGA')): | |
| return False | |
| # Check for internal stop codons (excluding the last codon) | |
| codons = [dna_seq[i:i+3].upper() for i in range(0, len(dna_seq) - 3, 3)] | |
| if any(codon in ['TAA', 'TAG', 'TGA'] for codon in codons): | |
| return False | |
| # Check if sequence contains only valid nucleotides | |
| if not all(c in 'ATGC' for c in dna_seq.upper()): | |
| return False | |
| return True | |
| </code></pre> | |
| </div> | |
| <!-- Section 8: Streamlit GUI --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">8</span> | |
| Streamlit GUI - Main Interface | |
| </h2> | |
| <div class="description"> | |
| Web-based graphical interface for ENCOT built with Streamlit, providing user-friendly access to optimization features. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π streamlit_gui/app.py</div> | |
| <div class="line-range">Lines 625-640 | Main function</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Streamlit app structure with tabs and model loading | |
| </div> | |
| <pre><code class="language-python">def main(): | |
| st.title("ENCOT") | |
| st.markdown("E. coli codon optimization with constraint-aware decoding and in silico evaluation metrics.") | |
| # Load model | |
| load_model_and_tokenizer() | |
| # Create the main tabbed interface | |
| tab1, tab2, tab3, tab4 = st.tabs([ | |
| "Single Optimize", | |
| "Batch Process", | |
| "Comparative Analysis", | |
| "Advanced Settings" | |
| ]) | |
| with tab1: | |
| single_sequence_optimization() | |
| with tab2: | |
| batch_processing() | |
| with tab3: | |
| comparative_analysis() | |
| with tab4: | |
| advanced_settings() | |
| # Footer | |
| st.markdown("---") | |
| st.markdown("**ENCOT**") | |
| st.markdown("Open-source codon optimization for E. coli with reproducible evaluation.") | |
| </code></pre> | |
| </div> | |
| <!-- Section 9: Benchmark Evaluation --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">9</span> | |
| Benchmark Evaluation Pipeline | |
| </h2> | |
| <div class="description"> | |
| Comprehensive benchmarking pipeline for evaluating ENCOT performance on test sequences with multiple metrics. | |
| </div> | |
| <div class="file-info"> | |
| <div class="file-path">π benchmark_evaluation.py</div> | |
| <div class="line-range">Lines 300-400 | Benchmark function</div> | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Highlight:</strong> Multi-metric evaluation (CAI, tAI, GC, cis-elements) | |
| </div> | |
| <pre><code class="language-python">def benchmark_sequences(sequences, model, tokenizer, device, cai_weights, tai_weights): | |
| """ | |
| Run ENCOT on protein sequences and compute metrics for optimized DNA. | |
| Args: | |
| sequences: List of protein sequences to optimize | |
| model: Loaded ENCOT model | |
| tokenizer: Tokenizer for the model | |
| device: PyTorch device (CPU/GPU) | |
| cai_weights: Pre-computed CAI weights | |
| tai_weights: Pre-computed tAI weights | |
| Returns: | |
| DataFrame with optimization results and metrics | |
| """ | |
| results = [] | |
| for name, protein in tqdm(sequences, desc="Optimizing sequences"): | |
| # Optimize the sequence | |
| output = predict_dna_sequence( | |
| protein=protein, | |
| organism="Escherichia coli general", | |
| device=device, | |
| model=model, | |
| tokenizer=tokenizer, | |
| deterministic=True, | |
| use_constrained_search=True, | |
| gc_bounds=(0.45, 0.55) | |
| ) | |
| optimized_dna = output.predicted_dna | |
| # Calculate metrics | |
| cai = get_CSI_value(optimized_dna, cai_weights) | |
| tai = calculate_tAI(optimized_dna, tai_weights) | |
| gc_content = get_GC_content(optimized_dna) | |
| cis_elements = count_negative_cis_elements(optimized_dna) | |
| results.append({ | |
| 'name': name, | |
| 'protein': protein, | |
| 'optimized_dna': optimized_dna, | |
| 'CAI': cai, | |
| 'tAI': tai, | |
| 'GC_content': gc_content, | |
| 'negative_cis_elements': cis_elements | |
| }) | |
| return pd.DataFrame(results) | |
| </code></pre> | |
| </div> | |
| <!-- Section 10: Project Structure --> | |
| <div class="section"> | |
| <h2 class="section-title"> | |
| <span class="section-number">10</span> | |
| Project Overview & Architecture | |
| </h2> | |
| <div class="description"> | |
| Complete project structure showing the organization of modules, scripts, and configuration files. | |
| </div> | |
| <div class="key-feature"> | |
| <strong>π― Key Components:</strong> Training (finetune.py), Inference (CodonPrediction.py), | |
| Evaluation (CodonEvaluation.py), GUI (streamlit_gui/), Configs (configs/) | |
| </div> | |
| <pre><code class="language-plaintext">ENCOT/ | |
| βββ CodonTransformer/ # Core library modules | |
| β βββ CodonPrediction.py # Model loading & DNA sequence prediction | |
| β βββ CodonEvaluation.py # Metrics (CAI, tAI, GC, CFD, etc.) | |
| β βββ CodonData.py # Data preprocessing & preparation | |
| β βββ CodonUtils.py # Constants, mappings, utilities | |
| β βββ CodonPostProcessing.py # DNA-Chisel integration | |
| β | |
| βββ scripts/ # Command-line tools | |
| β βββ train.py # Training wrapper | |
| β βββ optimize_sequence.py # Sequence optimization CLI | |
| β βββ run_benchmarks.py # Benchmark evaluation | |
| β βββ preprocess_data.py # Data preparation | |
| β | |
| βββ configs/ # YAML configurations | |
| β βββ train_ecoli_alm.yaml # Main ALM training config β | |
| β βββ train_ecoli_quick.yaml # Quick test config | |
| β | |
| βββ streamlit_gui/ # Web interface | |
| β βββ app.py # Main Streamlit GUI β | |
| β βββ demo.py # Demo script | |
| β βββ run_gui.py # Launcher | |
| β | |
| βββ data/ # Datasets | |
| β βββ finetune_set.json # Training data | |
| β βββ test_set.json # Test data | |
| β | |
| βββ finetune.py # Main training script βββ | |
| βββ benchmark_evaluation.py # Evaluation script | |
| βββ setup.py # Package setup | |
| βββ pyproject.toml # Project configuration | |
| βββ README.md # Documentation | |
| Key Innovations: | |
| βββ Augmented-Lagrangian Method (ALM) for GC control | |
| ββ Constrained beam search with GC bounds | |
| β Multi-metric evaluation (CAI, tAI, GC, cis-elements) | |
| </code></pre> | |
| </div> | |
| <div class="footer"> | |
| <h3>ENCOT - Enhanced Codon Optimization Tool</h3> | |
| <p>Repository: <a href="https://github.com/geno543/ENCOT" style="color: #58a6ff;">github.com/geno543/ENCOT</a></p> | |
| <p>Β© 2026 | Apache License 2.0</p> | |
| </div> | |
| <script> | |
| // Initialize syntax highlighting | |
| hljs.highlightAll(); | |
| // Add line numbers | |
| document.querySelectorAll('pre code').forEach((block) => { | |
| const lines = block.innerHTML.split('\n'); | |
| const numberedLines = lines.map((line, index) => { | |
| return `<span class="line-number" style="color: #6e7681; user-select: none; margin-right: 1em;">${String(index + 1).padStart(3, ' ')}</span>${line}`; | |
| }).join('\n'); | |
| block.innerHTML = numberedLines; | |
| }); | |
| </script> | |
| </body> | |
| </html> |