FlowAMP / model_card.md
esunAI's picture
Initial FlowAMP upload: Complete project with all essential files
370f342
metadata
language:
  - en
tags:
  - protein-design
  - antimicrobial-peptides
  - flow-matching
  - esm-2
  - pytorch
license: mit
datasets:
  - uniprot
  - amp-datasets
metrics:
  - mic-prediction
  - sequence-validity
  - diversity

FlowAMP: Flow-based Antimicrobial Peptide Generation

Model Description

FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences.

Architecture

The model consists of several key components:

  1. ESM-2 Encoder: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings
  2. Compressor/Decompressor: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing
  3. Flow Matcher: Implements conditional flow matching for generation with time embeddings
  4. CFG Integration: Classifier-free guidance for controllable generation

Key Features

  • Flow-based Generation: Uses conditional flow matching for high-quality peptide generation
  • ESM-2 Integration: Leverages ESM-2 protein language model embeddings for sequence understanding
  • CFG Training: Implements Classifier-Free Guidance for controllable generation
  • Multi-GPU Training: Optimized for H100 GPUs with mixed precision training
  • Comprehensive Evaluation: MIC prediction and antimicrobial activity assessment

Training

Training Data

The model was trained on:

  • UniProt Database: Comprehensive protein sequence database
  • AMP Datasets: Curated antimicrobial peptide sequences
  • ESM-2 Embeddings: Pre-computed embeddings for efficient training

Training Configuration

  • Batch Size: 96 (optimized for H100)
  • Learning Rate: 4e-4 with cosine annealing to 2e-4
  • Epochs: 6000
  • Mixed Precision: BF16 for H100 optimization
  • CFG Dropout: 15% for unconditional training
  • Gradient Clipping: Norm=1.0 for stability

Training Performance

  • Speed: 31 steps/second on H100 GPU
  • Memory Efficiency: Mixed precision training
  • Stability: Gradient clipping and weight decay (0.01)

Usage

Basic Generation

from final_flow_model import AMPFlowMatcherCFGConcat
from generate_amps import generate_amps

# Load trained model
model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')

# Generate AMPs with different CFG strengths
sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0)
sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0)
sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0)
sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0)

Evaluation

from test_generated_peptides import evaluate_generated_peptides

# Evaluate generated sequences for antimicrobial activity
results = evaluate_generated_peptides(sequences)

Performance

Generation Quality

  • Sequence Validity: High percentage of valid peptide sequences
  • Diversity: Good sequence diversity across different CFG strengths
  • Biological Relevance: ESM-2 embeddings ensure biologically meaningful sequences

Antimicrobial Activity

  • MIC Prediction: Integration with Apex model for MIC prediction
  • Activity Assessment: Comprehensive evaluation of antimicrobial potential
  • CFG Effectiveness: Measured through controlled generation

Limitations

  • Sequence Length: Limited to 50 amino acids maximum
  • Computational Requirements: Requires GPU for efficient generation
  • Training Data: Dependent on quality of UniProt and AMP datasets

Citation

@article{flowamp2024,
  title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
  author={Sun, Edward},
  journal={arXiv preprint},
  year={2024}
}

License

MIT License - see LICENSE file for details.