|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- protein-design |
|
|
- antimicrobial-peptides |
|
|
- flow-matching |
|
|
- esm-2 |
|
|
- pytorch |
|
|
license: mit |
|
|
datasets: |
|
|
- uniprot |
|
|
- amp-datasets |
|
|
metrics: |
|
|
- mic-prediction |
|
|
- sequence-validity |
|
|
- diversity |
|
|
--- |
|
|
|
|
|
# FlowAMP: Flow-based Antimicrobial Peptide Generation |
|
|
|
|
|
## Model Description |
|
|
|
|
|
FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences. |
|
|
|
|
|
### Architecture |
|
|
|
|
|
The model consists of several key components: |
|
|
|
|
|
1. **ESM-2 Encoder**: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings |
|
|
2. **Compressor/Decompressor**: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing |
|
|
3. **Flow Matcher**: Implements conditional flow matching for generation with time embeddings |
|
|
4. **CFG Integration**: Classifier-free guidance for controllable generation |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation |
|
|
- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding |
|
|
- **CFG Training**: Implements Classifier-Free Guidance for controllable generation |
|
|
- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training |
|
|
- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment |
|
|
|
|
|
## Training |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on: |
|
|
- **UniProt Database**: Comprehensive protein sequence database |
|
|
- **AMP Datasets**: Curated antimicrobial peptide sequences |
|
|
- **ESM-2 Embeddings**: Pre-computed embeddings for efficient training |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Batch Size**: 96 (optimized for H100) |
|
|
- **Learning Rate**: 4e-4 with cosine annealing to 2e-4 |
|
|
- **Epochs**: 6000 |
|
|
- **Mixed Precision**: BF16 for H100 optimization |
|
|
- **CFG Dropout**: 15% for unconditional training |
|
|
- **Gradient Clipping**: Norm=1.0 for stability |
|
|
|
|
|
### Training Performance |
|
|
|
|
|
- **Speed**: 31 steps/second on H100 GPU |
|
|
- **Memory Efficiency**: Mixed precision training |
|
|
- **Stability**: Gradient clipping and weight decay (0.01) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Generation |
|
|
|
|
|
```python |
|
|
from final_flow_model import AMPFlowMatcherCFGConcat |
|
|
from generate_amps import generate_amps |
|
|
|
|
|
# Load trained model |
|
|
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth') |
|
|
|
|
|
# Generate AMPs with different CFG strengths |
|
|
sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0) |
|
|
sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0) |
|
|
sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0) |
|
|
sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0) |
|
|
``` |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
```python |
|
|
from test_generated_peptides import evaluate_generated_peptides |
|
|
|
|
|
# Evaluate generated sequences for antimicrobial activity |
|
|
results = evaluate_generated_peptides(sequences) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Generation Quality |
|
|
|
|
|
- **Sequence Validity**: High percentage of valid peptide sequences |
|
|
- **Diversity**: Good sequence diversity across different CFG strengths |
|
|
- **Biological Relevance**: ESM-2 embeddings ensure biologically meaningful sequences |
|
|
|
|
|
### Antimicrobial Activity |
|
|
|
|
|
- **MIC Prediction**: Integration with Apex model for MIC prediction |
|
|
- **Activity Assessment**: Comprehensive evaluation of antimicrobial potential |
|
|
- **CFG Effectiveness**: Measured through controlled generation |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Sequence Length**: Limited to 50 amino acids maximum |
|
|
- **Computational Requirements**: Requires GPU for efficient generation |
|
|
- **Training Data**: Dependent on quality of UniProt and AMP datasets |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{flowamp2024, |
|
|
title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching}, |
|
|
author={Sun, Edward}, |
|
|
journal={arXiv preprint}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - see LICENSE file for details. |
|
|
|