--- language: - en tags: - protein-design - antimicrobial-peptides - flow-matching - esm-2 - pytorch license: mit datasets: - uniprot - amp-datasets metrics: - mic-prediction - sequence-validity - diversity --- # FlowAMP: Flow-based Antimicrobial Peptide Generation ## Model Description FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences. ### Architecture The model consists of several key components: 1. **ESM-2 Encoder**: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings 2. **Compressor/Decompressor**: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing 3. **Flow Matcher**: Implements conditional flow matching for generation with time embeddings 4. **CFG Integration**: Classifier-free guidance for controllable generation ### Key Features - **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation - **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding - **CFG Training**: Implements Classifier-Free Guidance for controllable generation - **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training - **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment ## Training ### Training Data The model was trained on: - **UniProt Database**: Comprehensive protein sequence database - **AMP Datasets**: Curated antimicrobial peptide sequences - **ESM-2 Embeddings**: Pre-computed embeddings for efficient training ### Training Configuration - **Batch Size**: 96 (optimized for H100) - **Learning Rate**: 4e-4 with cosine annealing to 2e-4 - **Epochs**: 6000 - **Mixed Precision**: BF16 for H100 optimization - **CFG Dropout**: 15% for unconditional training - **Gradient Clipping**: Norm=1.0 for stability ### Training Performance - **Speed**: 31 steps/second on H100 GPU - **Memory Efficiency**: Mixed precision training - **Stability**: Gradient clipping and weight decay (0.01) ## Usage ### Basic Generation ```python from final_flow_model import AMPFlowMatcherCFGConcat from generate_amps import generate_amps # Load trained model model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth') # Generate AMPs with different CFG strengths sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0) sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0) sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0) sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0) ``` ### Evaluation ```python from test_generated_peptides import evaluate_generated_peptides # Evaluate generated sequences for antimicrobial activity results = evaluate_generated_peptides(sequences) ``` ## Performance ### Generation Quality - **Sequence Validity**: High percentage of valid peptide sequences - **Diversity**: Good sequence diversity across different CFG strengths - **Biological Relevance**: ESM-2 embeddings ensure biologically meaningful sequences ### Antimicrobial Activity - **MIC Prediction**: Integration with Apex model for MIC prediction - **Activity Assessment**: Comprehensive evaluation of antimicrobial potential - **CFG Effectiveness**: Measured through controlled generation ## Limitations - **Sequence Length**: Limited to 50 amino acids maximum - **Computational Requirements**: Requires GPU for efficient generation - **Training Data**: Dependent on quality of UniProt and AMP datasets ## Citation ```bibtex @article{flowamp2024, title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching}, author={Sun, Edward}, journal={arXiv preprint}, year={2024} } ``` ## License MIT License - see LICENSE file for details.