File size: 5,255 Bytes
370f342 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# FlowAMP: Flow-based Antimicrobial Peptide Generation
## Overview
FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. This project implements a state-of-the-art approach for de novo AMP design with improved generation quality and diversity.
## Key Features
- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
## Project Structure
```
flow/
βββ final_flow_model.py # Main FlowAMP model architecture
βββ final_sequence_encoder.py # ESM-2 sequence encoding
βββ final_sequence_decoder.py # Sequence decoding and generation
βββ compressor_with_embeddings.py # Embedding compression/decompression
βββ cfg_dataset.py # CFG dataset and dataloader
βββ amp_flow_training_single_gpu_full_data.py # Single GPU training
βββ amp_flow_training_multi_gpu.py # Multi-GPU training
βββ generate_amps.py # AMP generation script
βββ test_generated_peptides.py # Evaluation and testing
βββ apex/ # Apex model integration
β βββ trained_models/ # Pre-trained Apex models
β βββ AMP_DL_model_twohead.py # Apex model architecture
βββ normalization_stats.pt # Preprocessing statistics
βββ requirements.yaml # Dependencies
```
## Model Architecture
The FlowAMP model consists of:
1. **ESM-2 Encoder**: Extracts protein sequence embeddings using ESM-2
2. **Compressor/Decompressor**: Reduces embedding dimensionality for efficiency
3. **Flow Matcher**: Conditional flow matching for generation
4. **CFG Integration**: Classifier-free guidance for controllable generation
## Training
### Single GPU Training
```bash
python amp_flow_training_single_gpu_full_data.py
```
### Multi-GPU Training
```bash
bash launch_multi_gpu_training.sh
```
### Key Training Parameters
- **Batch Size**: 96 (optimized for H100)
- **Learning Rate**: 4e-4 with cosine annealing
- **Epochs**: 6000
- **Mixed Precision**: BF16 for H100 optimization
- **CFG Dropout**: 15% for unconditional training
## Generation
Generate AMPs with different CFG strengths:
```bash
python generate_amps.py --cfg_strength 0.0 # No CFG
python generate_amps.py --cfg_strength 1.0 # Weak CFG
python generate_amps.py --cfg_strength 2.0 # Strong CFG
python generate_amps.py --cfg_strength 3.0 # Very Strong CFG
```
## Evaluation
### MIC Prediction
The model includes integration with Apex for MIC (Minimum Inhibitory Concentration) prediction:
```bash
python test_generated_peptides.py
```
### Performance Metrics
- **Generation Quality**: Evaluated using sequence diversity and validity
- **Antimicrobial Activity**: Predicted using Apex model integration
- **CFG Effectiveness**: Measured through controlled generation
## Results
### Training Performance
- **Optimized for H100**: 31 steps/second with batch size 96
- **Mixed Precision**: BF16 training for memory efficiency
- **Gradient Clipping**: Stable training with norm=1.0
### Generation Results
- **Sequence Validity**: High percentage of valid peptide sequences
- **Diversity**: Good sequence diversity across different CFG strengths
- **Antimicrobial Potential**: Predicted MIC values for generated sequences
## Dependencies
Key dependencies include:
- PyTorch 2.0+
- Transformers (for ESM-2)
- Wandb (optional logging)
- Apex (for MIC prediction)
See `requirements.yaml` for complete dependency list.
## Usage Examples
### Basic AMP Generation
```python
from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps
# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
# Generate AMPs
sequences = generate_amps(model, num_samples=100, cfg_strength=1.0)
```
### Evaluation
```python
from test_generated_peptides import evaluate_generated_peptides
# Evaluate generated sequences
results = evaluate_generated_peptides(sequences)
```
## Research Impact
This work contributes to:
- **Flow-based Protein Design**: Novel application of flow matching to peptide generation
- **Conditional Generation**: CFG integration for controllable AMP design
- **ESM-2 Integration**: Leveraging protein language models for sequence understanding
- **Antimicrobial Discovery**: Automated design of potential therapeutic peptides
## Citation
If you use this code in your research, please cite:
```bibtex
@article{flowamp2024,
title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
author={Sun, Edward},
journal={arXiv preprint},
year={2024}
}
```
## License
MIT License - see LICENSE file for details.
## Contact
For questions or collaboration, please contact the authors.
|