File size: 4,084 Bytes
370f342
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
language:
- en
tags:
- protein-design
- antimicrobial-peptides
- flow-matching
- esm-2
- pytorch
license: mit
datasets:
- uniprot
- amp-datasets
metrics:
- mic-prediction
- sequence-validity
- diversity
---

# FlowAMP: Flow-based Antimicrobial Peptide Generation

## Model Description

FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences.

### Architecture

The model consists of several key components:

1. **ESM-2 Encoder**: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings
2. **Compressor/Decompressor**: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing
3. **Flow Matcher**: Implements conditional flow matching for generation with time embeddings
4. **CFG Integration**: Classifier-free guidance for controllable generation

### Key Features

- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment

## Training

### Training Data

The model was trained on:
- **UniProt Database**: Comprehensive protein sequence database
- **AMP Datasets**: Curated antimicrobial peptide sequences
- **ESM-2 Embeddings**: Pre-computed embeddings for efficient training

### Training Configuration

- **Batch Size**: 96 (optimized for H100)
- **Learning Rate**: 4e-4 with cosine annealing to 2e-4
- **Epochs**: 6000
- **Mixed Precision**: BF16 for H100 optimization
- **CFG Dropout**: 15% for unconditional training
- **Gradient Clipping**: Norm=1.0 for stability

### Training Performance

- **Speed**: 31 steps/second on H100 GPU
- **Memory Efficiency**: Mixed precision training
- **Stability**: Gradient clipping and weight decay (0.01)

## Usage

### Basic Generation

```python
from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps

# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')

# Generate AMPs with different CFG strengths
sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0)
sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0)
sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0)
sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0)
```

### Evaluation

```python
from test_generated_peptides import evaluate_generated_peptides

# Evaluate generated sequences for antimicrobial activity
results = evaluate_generated_peptides(sequences)
```

## Performance

### Generation Quality

- **Sequence Validity**: High percentage of valid peptide sequences
- **Diversity**: Good sequence diversity across different CFG strengths
- **Biological Relevance**: ESM-2 embeddings ensure biologically meaningful sequences

### Antimicrobial Activity

- **MIC Prediction**: Integration with Apex model for MIC prediction
- **Activity Assessment**: Comprehensive evaluation of antimicrobial potential
- **CFG Effectiveness**: Measured through controlled generation

## Limitations

- **Sequence Length**: Limited to 50 amino acids maximum
- **Computational Requirements**: Requires GPU for efficient generation
- **Training Data**: Dependent on quality of UniProt and AMP datasets

## Citation

```bibtex
@article{flowamp2024,
  title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
  author={Sun, Edward},
  journal={arXiv preprint},
  year={2024}
}
```

## License

MIT License - see LICENSE file for details.