metadata
language:
- en
tags:
- protein-design
- antimicrobial-peptides
- flow-matching
- esm-2
- pytorch
license: mit
datasets:
- uniprot
- amp-datasets
metrics:
- mic-prediction
- sequence-validity
- diversity
FlowAMP: Flow-based Antimicrobial Peptide Generation
Model Description
FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences.
Architecture
The model consists of several key components:
- ESM-2 Encoder: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings
- Compressor/Decompressor: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing
- Flow Matcher: Implements conditional flow matching for generation with time embeddings
- CFG Integration: Classifier-free guidance for controllable generation
Key Features
- Flow-based Generation: Uses conditional flow matching for high-quality peptide generation
- ESM-2 Integration: Leverages ESM-2 protein language model embeddings for sequence understanding
- CFG Training: Implements Classifier-Free Guidance for controllable generation
- Multi-GPU Training: Optimized for H100 GPUs with mixed precision training
- Comprehensive Evaluation: MIC prediction and antimicrobial activity assessment
Training
Training Data
The model was trained on:
- UniProt Database: Comprehensive protein sequence database
- AMP Datasets: Curated antimicrobial peptide sequences
- ESM-2 Embeddings: Pre-computed embeddings for efficient training
Training Configuration
- Batch Size: 96 (optimized for H100)
- Learning Rate: 4e-4 with cosine annealing to 2e-4
- Epochs: 6000
- Mixed Precision: BF16 for H100 optimization
- CFG Dropout: 15% for unconditional training
- Gradient Clipping: Norm=1.0 for stability
Training Performance
- Speed: 31 steps/second on H100 GPU
- Memory Efficiency: Mixed precision training
- Stability: Gradient clipping and weight decay (0.01)
Usage
Basic Generation
from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps
# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
# Generate AMPs with different CFG strengths
sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0)
sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0)
sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0)
sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0)
Evaluation
from test_generated_peptides import evaluate_generated_peptides
# Evaluate generated sequences for antimicrobial activity
results = evaluate_generated_peptides(sequences)
Performance
Generation Quality
- Sequence Validity: High percentage of valid peptide sequences
- Diversity: Good sequence diversity across different CFG strengths
- Biological Relevance: ESM-2 embeddings ensure biologically meaningful sequences
Antimicrobial Activity
- MIC Prediction: Integration with Apex model for MIC prediction
- Activity Assessment: Comprehensive evaluation of antimicrobial potential
- CFG Effectiveness: Measured through controlled generation
Limitations
- Sequence Length: Limited to 50 amino acids maximum
- Computational Requirements: Requires GPU for efficient generation
- Training Data: Dependent on quality of UniProt and AMP datasets
Citation
@article{flowamp2024,
title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
author={Sun, Edward},
journal={arXiv preprint},
year={2024}
}
License
MIT License - see LICENSE file for details.