Initial upload: ESM-2 based stability predictor

Browse files

Files changed (4) hide show

README.md +213 -0
config.json +26 -0
stability_predictor.pt +3 -0
stability_predictor.py +244 -0

README.md ADDED Viewed

	@@ -0,0 +1,213 @@

+---
+license: mit
+tags:
+  - biology
+  - peptide
+  - protein
+  - stability
+  - esm2
+  - thermostability
+  - drug-discovery
+  - pytorch
+language:
+  - en
+library_name: pytorch
+pipeline_tag: text-classification
+datasets:
+  - FLIP
+metrics:
+  - r2
+---
+# Peptide Stability Predictor
+Predict thermal stability of peptide/protein sequences using ESM-2 embeddings.
+## Model Description
+This model predicts the thermal stability (melting temperature proxy) of peptide and protein sequences using frozen ESM-2 embeddings passed through a trained MLP regression head. It was trained on the FLIP Meltome benchmark dataset.
+### Architecture
+| Component | Details |
+|-----------|---------|
+| Backbone | ESM-2 (esm2_t6_8M_UR50D, 8M parameters, frozen) |
+| Embedding dim | 320 |
+| MLP Head | Linear(320→256) → ReLU → Dropout(0.1) → Linear(256→128) → ReLU → Dropout(0.1) → Linear(128→1) |
+| Output | Normalized stability score |
+### Training Details
+| Property | Value |
+|----------|-------|
+| Dataset | FLIP Meltome benchmark |
+| Validation R² | 0.616 |
+| Epochs | 16 (early stopped from 30) |
+| Learning rate | 1e-3 |
+| Batch size | 8 |
+| Dropout | 0.1 |
+## Quick Start
+### Requirements
+```bash
+pip install torch fair-esm huggingface_hub
+```
+### Usage
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download model checkpoint
+checkpoint_path = hf_hub_download(
+    repo_id="littleworth/peptide-stability-predictor",
+    filename="stability_predictor.pt"
+)
+# Load checkpoint
+checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+# Download model class
+model_file = hf_hub_download(
+    repo_id="littleworth/peptide-stability-predictor",
+    filename="stability_predictor.py"
+)
+# Import model class
+import importlib.util
+spec = importlib.util.spec_from_file_location("stability_predictor", model_file)
+sp_module = importlib.util.module_from_spec(spec)
+spec.loader.exec_module(sp_module)
+StabilityPredictor = sp_module.StabilityPredictor
+# Initialize model (this will download ESM-2 on first run)
+model = StabilityPredictor(esm_model="esm2_t6_8M_UR50D")
+# Load trained weights (only the MLP head, ESM-2 is frozen)
+# Filter to only load head weights
+head_state_dict = {k: v for k, v in checkpoint['model_state_dict'].items()
+                   if k.startswith('head.')}
+model.head.load_state_dict({k.replace('head.', ''): v for k, v in head_state_dict.items()})
+model.eval()
+# Predict stability
+sequences = [
+    "MKTLYFLGASV",
+    "AEITVKLSPGMNCF",
+    "GFLWKASTDERIPMNCVYH",
+]
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+with torch.no_grad():
+    scores = model(sequences)
+print("Stability predictions:")
+for seq, score in zip(sequences, scores.tolist()):
+    print(f"  {seq}: {score:.4f}")
+```
+### Alternative: Using predict() method
+```python
+# Using the convenience method (returns Python list)
+scores = model.predict(sequences)
+print(scores)  # [0.7234, 0.6521, 0.5892]
+```
+## Example Output
+```
+Stability predictions:
+  MKTLYFLGASV: 0.7234
+  AEITVKLSPGMNCF: 0.6521
+  GFLWKASTDERIPMNCVYH: 0.5892
+```
+## Files in This Repository
+| File | Description |
+|------|-------------|
+| `stability_predictor.pt` | Model checkpoint (MLP head weights) |
+| `stability_predictor.py` | Model architecture definition |
+| `config.json` | Model configuration |
+## Checkpoint Contents
+```python
+{
+    'epoch': 16,
+    'model_state_dict': {...},  # MLP head weights
+    'optimizer_state_dict': {...},
+    'val_r2': 0.616,
+    'config': {
+        'esm_model': 'esm2_t6_8M_UR50D',
+        'hidden_dims': [256, 128],
+        'dropout': 0.1
+    }
+}
+```
+## Intended Use
+- **Primary use**: Scoring peptide/protein stability for drug discovery
+- **Secondary uses**:
+  - Filtering generated peptide candidates
+  - Research on protein thermostability
+  - Feature engineering for downstream ML models
+## Limitations
+- Trained on FLIP Meltome data which may not generalize to all protein families
+- Outputs normalized scores, not absolute melting temperatures
+- Predictions are computational estimates requiring experimental validation
+- Best accuracy for sequences similar to training distribution
+## Performance
+| Metric | Value |
+|--------|-------|
+| Validation R² | 0.616 |
+| Training epochs | 16 |
+| Early stopping patience | 15 |
+## Dependencies
+- PyTorch >= 2.0
+- fair-esm (Facebook's ESM library)
+- huggingface_hub
+## Ethical Considerations
+This model provides computational predictions of protein stability. Predictions should be validated experimentally before making decisions about therapeutic development. The model does not guarantee accuracy for sequences outside its training distribution.
+## Training Data
+- **FLIP Meltome benchmark**: A dataset of protein sequences with measured thermal stability values
+- Training/validation split following FLIP benchmark protocols
+## Citation
+```bibtex
+@software{peptide_stability_2025,
+  author = {Wijaya, Edward},
+  title = {Peptide Stability Predictor},
+  year = {2025},
+  url = {https://huggingface.co/littleworth/peptide-stability-predictor},
+  note = {ESM-2 based thermal stability prediction}
+}
+```
+## References
+- [FLIP Benchmark](https://github.com/J-SNACKKB/FLIP) - Dallago et al., 2021
+- [ESM-2](https://github.com/facebookresearch/esm) - Lin et al., 2022
+- [ESM-2 Paper](https://www.science.org/doi/10.1126/science.ade2574) - Lin et al., Science 2023
+## License
+MIT License

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "model_type": "stability_predictor",
+  "architecture": "esm2_mlp",
+  "esm_model": "esm2_t6_8M_UR50D",
+  "esm_params": 8000000,
+  "embed_dim": 320,
+  "repr_layer": 6,
+  "freeze_esm": true,
+  "head": {
+    "hidden_dims": [256, 128],
+    "dropout": 0.1,
+    "activation": "relu"
+  },
+  "training": {
+    "dataset": "FLIP_meltome",
+    "epochs": 16,
+    "learning_rate": 0.001,
+    "batch_size": 8,
+    "early_stopping_patience": 15,
+    "validation_r2": 0.616
+  },
+  "output": {
+    "type": "regression",
+    "description": "Normalized thermal stability score"
+  }
+}

stability_predictor.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:67d1eb517578cf141bd113b1491f61176ff0beb63181d5879f2490d220c534d4
+size 31483165

stability_predictor.py ADDED Viewed

	@@ -0,0 +1,244 @@

+"""ESM-2 based stability predictor for peptide/protein sequences.
+This module implements a stability predictor using ESM-2 embeddings as input
+to an MLP regression head. The model predicts thermal stability (melting
+temperature) based on sequence information.
+Architecture:
+    Input: Peptide/protein sequence
+        ↓
+    ESM-2 (frozen): Extract mean-pooled embeddings
+        ↓
+    MLP: embedding_dim → hidden_dims → 1
+        ↓
+    Output: Stability score (normalized)
+"""
+import logging
+from typing import List, Optional, Union
+import torch
+import torch.nn as nn
+logger = logging.getLogger(__name__)
+class StabilityPredictor(nn.Module):
+    """ESM-2 based stability predictor.
+    Uses frozen ESM-2 embeddings as input to an MLP head for predicting
+    thermal stability. The model is designed to be trained on datasets
+    like FLIP stability (meltome) task.
+    Attributes:
+        esm: ESM-2 language model (frozen)
+        alphabet: ESM-2 tokenizer
+        head: MLP regression head
+        embed_dim: Dimension of ESM-2 embeddings
+        repr_layer: Which layer to extract representations from
+    """
+    def __init__(
+        self,
+        esm_model: str = "esm2_t6_8M_UR50D",
+        hidden_dims: Optional[List[int]] = None,
+        dropout: float = 0.1,
+        freeze_esm: bool = True,
+        device: Optional[str] = None,
+    ):
+        """Initialize stability predictor.
+        Args:
+            esm_model: Name of ESM-2 model to use. Options:
+                - esm2_t6_8M_UR50D (8M params, 320 dim, fastest)
+                - esm2_t12_35M_UR50D (35M params, 480 dim)
+                - esm2_t33_650M_UR50D (650M params, 1280 dim, most accurate)
+            hidden_dims: Hidden layer dimensions for MLP head.
+                Default: [256, 128]
+            dropout: Dropout rate for MLP layers
+            freeze_esm: Whether to freeze ESM-2 parameters
+            device: Device to load model on. Auto-detected if None.
+        """
+        super().__init__()
+        if hidden_dims is None:
+            hidden_dims = [256, 128]
+        self.esm_model_name = esm_model
+        self.freeze_esm = freeze_esm
+        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
+        # Load ESM-2
+        self._load_esm(esm_model)
+        if freeze_esm:
+            for param in self.esm.parameters():
+                param.requires_grad = False
+            self.esm.eval()
+        # Build MLP head
+        layers = []
+        in_dim = self.embed_dim
+        for h_dim in hidden_dims:
+            layers.extend([
+                nn.Linear(in_dim, h_dim),
+                nn.ReLU(),
+                nn.Dropout(dropout),
+            ])
+            in_dim = h_dim
+        layers.append(nn.Linear(in_dim, 1))
+        self.head = nn.Sequential(*layers)
+        logger.info(f"StabilityPredictor initialized with {esm_model}, "
+                    f"hidden_dims={hidden_dims}, freeze_esm={freeze_esm}")
+    def _load_esm(self, esm_model: str):
+        """Load ESM-2 model and set embedding dimensions."""
+        import esm
+        logger.info(f"Loading ESM-2 model: {esm_model}")
+        if esm_model == "esm2_t6_8M_UR50D":
+            self.esm, self.alphabet = esm.pretrained.esm2_t6_8M_UR50D()
+            self.embed_dim = 320
+            self.repr_layer = 6
+        elif esm_model == "esm2_t12_35M_UR50D":
+            self.esm, self.alphabet = esm.pretrained.esm2_t12_35M_UR50D()
+            self.embed_dim = 480
+            self.repr_layer = 12
+        elif esm_model == "esm2_t33_650M_UR50D":
+            self.esm, self.alphabet = esm.pretrained.esm2_t33_650M_UR50D()
+            self.embed_dim = 1280
+            self.repr_layer = 33
+        else:
+            raise ValueError(f"Unknown ESM model: {esm_model}")
+        self.batch_converter = self.alphabet.get_batch_converter()
+    def get_embeddings(self, sequences: List[str]) -> torch.Tensor:
+        """Extract ESM-2 embeddings for sequences.
+        Args:
+            sequences: List of amino acid sequences
+        Returns:
+            Tensor of shape (batch_size, embed_dim) with mean-pooled embeddings
+        """
+        # Prepare data for ESM
+        data = [(f"seq{i}", seq) for i, seq in enumerate(sequences)]
+        _, _, batch_tokens = self.batch_converter(data)
+        batch_tokens = batch_tokens.to(next(self.esm.parameters()).device)
+        # Forward pass through ESM-2
+        with torch.no_grad() if self.freeze_esm else torch.enable_grad():
+            results = self.esm(
+                batch_tokens,
+                repr_layers=[self.repr_layer],
+                return_contacts=False
+            )
+        # Mean pool over sequence positions (excluding BOS and EOS tokens)
+        embeddings = []
+        for i, seq in enumerate(sequences):
+            seq_len = len(seq)
+            # Tokens are: [BOS, seq..., EOS, PAD...]
+            # We want indices 1 to seq_len+1 (exclusive of EOS)
+            emb = results["representations"][self.repr_layer][i, 1:seq_len+1, :]
+            embeddings.append(emb.mean(dim=0))
+        return torch.stack(embeddings)
+    def forward(self, sequences: Union[str, List[str]]) -> torch.Tensor:
+        """Predict stability for sequences.
+        Args:
+            sequences: Single sequence or list of sequences
+        Returns:
+            Tensor of shape (batch_size,) with stability predictions
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        embeddings = self.get_embeddings(sequences)
+        predictions = self.head(embeddings).squeeze(-1)
+        return predictions
+    def predict(self, sequences: Union[str, List[str]]) -> List[float]:
+        """Predict stability scores (convenience method).
+        Args:
+            sequences: Single sequence or list of sequences
+        Returns:
+            List of stability scores
+        """
+        self.eval()
+        with torch.no_grad():
+            preds = self.forward(sequences)
+        return preds.cpu().tolist()
+    def to(self, device: Union[str, torch.device]) -> 'StabilityPredictor':
+        """Move model to device."""
+        self.device = str(device)
+        self.esm = self.esm.to(device)
+        self.head = self.head.to(device)
+        return super().to(device)
+class BindingPredictor(StabilityPredictor):
+    """ESM-2 based binding predictor.
+    Same architecture as StabilityPredictor but intended for binding
+    affinity prediction. Currently only supports binary classification
+    (binder vs non-binder) due to Propedia dataset limitations.
+    For regression tasks, additional data with continuous binding affinities
+    (e.g., from PDBbind) would be needed.
+    """
+    def __init__(
+        self,
+        esm_model: str = "esm2_t6_8M_UR50D",
+        hidden_dims: Optional[List[int]] = None,
+        dropout: float = 0.1,
+        freeze_esm: bool = True,
+        device: Optional[str] = None,
+        use_sigmoid: bool = True,
+    ):
+        """Initialize binding predictor.
+        Args:
+            esm_model: Name of ESM-2 model to use
+            hidden_dims: Hidden layer dimensions for MLP head
+            dropout: Dropout rate
+            freeze_esm: Whether to freeze ESM-2
+            device: Device to load model on
+            use_sigmoid: Whether to apply sigmoid for binary classification
+        """
+        super().__init__(
+            esm_model=esm_model,
+            hidden_dims=hidden_dims,
+            dropout=dropout,
+            freeze_esm=freeze_esm,
+            device=device,
+        )
+        self.use_sigmoid = use_sigmoid
+        logger.info(f"BindingPredictor initialized, use_sigmoid={use_sigmoid}")
+    def forward(self, sequences: Union[str, List[str]]) -> torch.Tensor:
+        """Predict binding score for sequences.
+        Args:
+            sequences: Single sequence or list of sequences
+        Returns:
+            Tensor of shape (batch_size,) with binding predictions.
+            If use_sigmoid=True, values are in [0, 1].
+        """
+        preds = super().forward(sequences)
+        if self.use_sigmoid:
+            preds = torch.sigmoid(preds)
+        return preds