FlowAMP / README.md
esunAI's picture
Initial FlowAMP upload: Complete project with all essential files
370f342
# FlowAMP: Flow-based Antimicrobial Peptide Generation
## Overview
FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. This project implements a state-of-the-art approach for de novo AMP design with improved generation quality and diversity.
## Key Features
- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
## Project Structure
```
flow/
β”œβ”€β”€ final_flow_model.py # Main FlowAMP model architecture
β”œβ”€β”€ final_sequence_encoder.py # ESM-2 sequence encoding
β”œβ”€β”€ final_sequence_decoder.py # Sequence decoding and generation
β”œβ”€β”€ compressor_with_embeddings.py # Embedding compression/decompression
β”œβ”€β”€ cfg_dataset.py # CFG dataset and dataloader
β”œβ”€β”€ amp_flow_training_single_gpu_full_data.py # Single GPU training
β”œβ”€β”€ amp_flow_training_multi_gpu.py # Multi-GPU training
β”œβ”€β”€ generate_amps.py # AMP generation script
β”œβ”€β”€ test_generated_peptides.py # Evaluation and testing
β”œβ”€β”€ apex/ # Apex model integration
β”‚ β”œβ”€β”€ trained_models/ # Pre-trained Apex models
β”‚ └── AMP_DL_model_twohead.py # Apex model architecture
β”œβ”€β”€ normalization_stats.pt # Preprocessing statistics
└── requirements.yaml # Dependencies
```
## Model Architecture
The FlowAMP model consists of:
1. **ESM-2 Encoder**: Extracts protein sequence embeddings using ESM-2
2. **Compressor/Decompressor**: Reduces embedding dimensionality for efficiency
3. **Flow Matcher**: Conditional flow matching for generation
4. **CFG Integration**: Classifier-free guidance for controllable generation
## Training
### Single GPU Training
```bash
python amp_flow_training_single_gpu_full_data.py
```
### Multi-GPU Training
```bash
bash launch_multi_gpu_training.sh
```
### Key Training Parameters
- **Batch Size**: 96 (optimized for H100)
- **Learning Rate**: 4e-4 with cosine annealing
- **Epochs**: 6000
- **Mixed Precision**: BF16 for H100 optimization
- **CFG Dropout**: 15% for unconditional training
## Generation
Generate AMPs with different CFG strengths:
```bash
python generate_amps.py --cfg_strength 0.0 # No CFG
python generate_amps.py --cfg_strength 1.0 # Weak CFG
python generate_amps.py --cfg_strength 2.0 # Strong CFG
python generate_amps.py --cfg_strength 3.0 # Very Strong CFG
```
## Evaluation
### MIC Prediction
The model includes integration with Apex for MIC (Minimum Inhibitory Concentration) prediction:
```bash
python test_generated_peptides.py
```
### Performance Metrics
- **Generation Quality**: Evaluated using sequence diversity and validity
- **Antimicrobial Activity**: Predicted using Apex model integration
- **CFG Effectiveness**: Measured through controlled generation
## Results
### Training Performance
- **Optimized for H100**: 31 steps/second with batch size 96
- **Mixed Precision**: BF16 training for memory efficiency
- **Gradient Clipping**: Stable training with norm=1.0
### Generation Results
- **Sequence Validity**: High percentage of valid peptide sequences
- **Diversity**: Good sequence diversity across different CFG strengths
- **Antimicrobial Potential**: Predicted MIC values for generated sequences
## Dependencies
Key dependencies include:
- PyTorch 2.0+
- Transformers (for ESM-2)
- Wandb (optional logging)
- Apex (for MIC prediction)
See `requirements.yaml` for complete dependency list.
## Usage Examples
### Basic AMP Generation
```python
from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps
# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
# Generate AMPs
sequences = generate_amps(model, num_samples=100, cfg_strength=1.0)
```
### Evaluation
```python
from test_generated_peptides import evaluate_generated_peptides
# Evaluate generated sequences
results = evaluate_generated_peptides(sequences)
```
## Research Impact
This work contributes to:
- **Flow-based Protein Design**: Novel application of flow matching to peptide generation
- **Conditional Generation**: CFG integration for controllable AMP design
- **ESM-2 Integration**: Leveraging protein language models for sequence understanding
- **Antimicrobial Discovery**: Automated design of potential therapeutic peptides
## Citation
If you use this code in your research, please cite:
```bibtex
@article{flowamp2024,
title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
author={Sun, Edward},
journal={arXiv preprint},
year={2024}
}
```
## License
MIT License - see LICENSE file for details.
## Contact
For questions or collaboration, please contact the authors.