GLP-ESM2-650M-Layer17

A Generative Latent Prior (GLP) model trained on ESM2-650M Layer 17 activations from UniRef50 protein sequences. GLP learns the distribution of protein representations via flow matching, enabling on-manifold projection during protein sequence steering.

Model Details

Property Value
Base PLM ESM2-650M (esm2_t33_650M_UR50D)
Extraction Layer 17 (of 33)
Input Dimension 1280
Denoiser Architecture 6-layer SwiGLU MLP, d_model=2560, d_mlp=5120
Denoiser Parameters ~335M
Training Data UniRef50 (~58M sequences, offline extraction)
Training 1 epoch, batch_size=16384, lr=5e-5, cosine schedule
Framework Flow Matching (FlowMatchEulerDiscreteScheduler)

Evaluation Metrics

Metric Value
Loss 0.949
FID (PCA-128d) 189.6
MMD (RBF) 0.070
NLL (Hutchinson) 805.6
gen_mean_err 0.152
gen_std_err 0.125

Files

File Description
final.safetensors Model weights (denoiser)
config.yaml Full model + training configuration
rep_statistics.pt Per-dimension mean/var for z-score normalization

Usage

Installation

git clone https://github.com/zhangshuibai/Steering-PLMs.git
cd Steering-PLMs
pip install torch esm safetensors omegaconf einops diffusers

1. Load the Model

import sys, os
sys.path.insert(0, "generative_latent_prior")

from glp.denoiser import load_glp

# Load from local path (after downloading)
model = load_glp("generative_latent_prior/runs/glp-esm2-650m-layer17-d6", device="cuda:0")

# Or load from HuggingFace directly
model = load_glp("Shuibai12138/glp-esm2-650m-layer17", device="cuda:0")

2. Generate Activations from Noise (Unconditional Sampling)

Sample synthetic ESM2-650M Layer 17 activations from the learned distribution:

import torch
from glp import flow_matching

# Sample from noise
noise = torch.randn(100, 1, 1280).to("cuda:0")  # 100 samples
gen_acts = flow_matching.sample(model, noise, num_timesteps=100)

# Denormalize back to ESM2 activation space
gen_acts = model.normalizer.denormalize(gen_acts)  # (100, 1, 1280)

3. On-Manifold Projection (SDEdit for Protein Steering)

The primary use case: after applying a steering vector to ESM2 activations, project the steered activations back onto the natural protein manifold to maintain sequence naturalness.

from glp import flow_matching

def project_on_manifold(model, acts, u=0.5, num_timesteps=20):
    """
    SDEdit-style projection.

    Args:
        model: loaded GLP model
        acts: (B, 1, 1280) raw ESM2 activations (possibly steered)
        u: noise level (0=no change, 0.5=moderate, 1.0=full reconstruction)
        num_timesteps: denoising steps

    Returns:
        (B, 1, 1280) on-manifold activations
    """
    model.scheduler.set_timesteps(num_timesteps)

    # Normalize
    latents = model.normalizer.normalize(acts)

    # Add noise to timestep u
    noise = torch.randn_like(latents)
    noisy_latents, _, timesteps, _ = flow_matching.fm_prepare(
        model.scheduler, latents, noise,
        u=torch.ones(latents.shape[0], device=latents.device) * u,
    )

    # Denoise (SDEdit)
    latents = flow_matching.sample_on_manifold(
        model, noisy_latents,
        start_timestep=timesteps[0].item(),
        num_timesteps=num_timesteps,
    )

    # Denormalize
    return model.normalizer.denormalize(latents)

4. Full Steering + GLP Pipeline

Generate steered protein sequences with on-manifold projection:

python steering_with_glp.py \
    --glp_path generative_latent_prior/runs/glp-esm2-650m-layer17-d6 \
    --gpu_gen cuda:0 --gpu_ppl 0 1 2 3 \
    --n_gen 100 --u 0.5

This script:

  1. Loads ESM2-650M and applies steering vectors at Layer 17
  2. Projects steered activations on-manifold via GLP SDEdit
  3. Generates protein sequences via iterative mask-predict
  4. Evaluates solubility (oracle) and naturalness (pPPL with ESM2-3B)

Key Code Files

File Role
steering_with_glp.py Main entry: steering + GLP projection + sequence generation + evaluation
generative_latent_prior/glp/denoiser.py Model definition: Normalizer, Denoiser, GLP, load_glp()
generative_latent_prior/glp/flow_matching.py fm_prepare(), sample(), sample_on_manifold()
generative_latent_prior/glp/script_steer.py Generic steering utilities: addition_intervention(), postprocess_on_manifold_wrapper()
generative_latent_prior/glp/script_eval.py FID evaluation and PCA visualization

Training

This model was trained using the offline pipeline (glp_train.py), which:

  1. Pre-extracts ESM2-650M Layer 17 activations from UniRef50 to disk (~1.1TB)
  2. Computes per-dimension mean/var statistics (rep_statistics.pt)
  3. Trains the GLP denoiser via flow matching on the cached activations

For training details, see:

How GLP Works

GLP (Generative Latent Prior) learns the distribution of ESM2 internal activations using flow matching:

Training:    Real ESM2 activations β†’ z-score normalize β†’ flow matching loss (predict velocity field)
Sampling:    Gaussian noise β†’ denoise with learned velocity β†’ denormalize β†’ synthetic activations
SDEdit:      Steered activation β†’ normalize β†’ add noise (level u) β†’ denoise β†’ denormalize β†’ on-manifold activation

The key insight: steering vectors can push activations off the natural protein manifold, degrading sequence quality. GLP's SDEdit projection pulls them back while preserving the steering direction.

Citation

@misc{steering-plms,
  title={Steering Protein Language Models},
  author={Zhang, Shuibai},
  year={2025},
  url={https://github.com/zhangshuibai/Steering-PLMs}
}
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support