File size: 5,255 Bytes
370f342
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# FlowAMP: Flow-based Antimicrobial Peptide Generation

## Overview

FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. This project implements a state-of-the-art approach for de novo AMP design with improved generation quality and diversity.

## Key Features

- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment

## Project Structure

```
flow/
β”œβ”€β”€ final_flow_model.py              # Main FlowAMP model architecture
β”œβ”€β”€ final_sequence_encoder.py        # ESM-2 sequence encoding
β”œβ”€β”€ final_sequence_decoder.py        # Sequence decoding and generation
β”œβ”€β”€ compressor_with_embeddings.py    # Embedding compression/decompression
β”œβ”€β”€ cfg_dataset.py                   # CFG dataset and dataloader
β”œβ”€β”€ amp_flow_training_single_gpu_full_data.py  # Single GPU training
β”œβ”€β”€ amp_flow_training_multi_gpu.py   # Multi-GPU training
β”œβ”€β”€ generate_amps.py                 # AMP generation script
β”œβ”€β”€ test_generated_peptides.py       # Evaluation and testing
β”œβ”€β”€ apex/                           # Apex model integration
β”‚   β”œβ”€β”€ trained_models/             # Pre-trained Apex models
β”‚   └── AMP_DL_model_twohead.py     # Apex model architecture
β”œβ”€β”€ normalization_stats.pt          # Preprocessing statistics
└── requirements.yaml               # Dependencies
```

## Model Architecture

The FlowAMP model consists of:

1. **ESM-2 Encoder**: Extracts protein sequence embeddings using ESM-2
2. **Compressor/Decompressor**: Reduces embedding dimensionality for efficiency
3. **Flow Matcher**: Conditional flow matching for generation
4. **CFG Integration**: Classifier-free guidance for controllable generation

## Training

### Single GPU Training
```bash
python amp_flow_training_single_gpu_full_data.py
```

### Multi-GPU Training
```bash
bash launch_multi_gpu_training.sh
```

### Key Training Parameters
- **Batch Size**: 96 (optimized for H100)
- **Learning Rate**: 4e-4 with cosine annealing
- **Epochs**: 6000
- **Mixed Precision**: BF16 for H100 optimization
- **CFG Dropout**: 15% for unconditional training

## Generation

Generate AMPs with different CFG strengths:

```bash
python generate_amps.py --cfg_strength 0.0    # No CFG
python generate_amps.py --cfg_strength 1.0    # Weak CFG
python generate_amps.py --cfg_strength 2.0    # Strong CFG
python generate_amps.py --cfg_strength 3.0    # Very Strong CFG
```

## Evaluation

### MIC Prediction
The model includes integration with Apex for MIC (Minimum Inhibitory Concentration) prediction:

```bash
python test_generated_peptides.py
```

### Performance Metrics
- **Generation Quality**: Evaluated using sequence diversity and validity
- **Antimicrobial Activity**: Predicted using Apex model integration
- **CFG Effectiveness**: Measured through controlled generation

## Results

### Training Performance
- **Optimized for H100**: 31 steps/second with batch size 96
- **Mixed Precision**: BF16 training for memory efficiency
- **Gradient Clipping**: Stable training with norm=1.0

### Generation Results
- **Sequence Validity**: High percentage of valid peptide sequences
- **Diversity**: Good sequence diversity across different CFG strengths
- **Antimicrobial Potential**: Predicted MIC values for generated sequences

## Dependencies

Key dependencies include:
- PyTorch 2.0+
- Transformers (for ESM-2)
- Wandb (optional logging)
- Apex (for MIC prediction)

See `requirements.yaml` for complete dependency list.

## Usage Examples

### Basic AMP Generation
```python
from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps

# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')

# Generate AMPs
sequences = generate_amps(model, num_samples=100, cfg_strength=1.0)
```

### Evaluation
```python
from test_generated_peptides import evaluate_generated_peptides

# Evaluate generated sequences
results = evaluate_generated_peptides(sequences)
```

## Research Impact

This work contributes to:
- **Flow-based Protein Design**: Novel application of flow matching to peptide generation
- **Conditional Generation**: CFG integration for controllable AMP design
- **ESM-2 Integration**: Leveraging protein language models for sequence understanding
- **Antimicrobial Discovery**: Automated design of potential therapeutic peptides

## Citation

If you use this code in your research, please cite:

```bibtex
@article{flowamp2024,
  title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
  author={Sun, Edward},
  journal={arXiv preprint},
  year={2024}
}
```

## License

MIT License - see LICENSE file for details.

## Contact

For questions or collaboration, please contact the authors.