GLP-ESM2-650M-Layer17

A Generative Latent Prior (GLP) model trained on ESM2-650M Layer 17 activations from UniRef50 protein sequences. GLP learns the distribution of protein representations via flow matching, enabling on-manifold projection during protein sequence steering.

Model Details

Property	Value
Base PLM	ESM2-650M (`esm2_t33_650M_UR50D`)
Extraction Layer	17 (of 33)
Input Dimension	1280
Denoiser Architecture	6-layer SwiGLU MLP, d_model=2560, d_mlp=5120
Denoiser Parameters	~335M
Training Data	UniRef50 (~58M sequences, offline extraction)
Training	1 epoch, batch_size=16384, lr=5e-5, cosine schedule
Framework	Flow Matching (FlowMatchEulerDiscreteScheduler)

Evaluation Metrics

Metric	Value
Loss	0.949
FID (PCA-128d)	189.6
MMD (RBF)	0.070
NLL (Hutchinson)	805.6
gen_mean_err	0.152
gen_std_err	0.125

Files

File	Description
`final.safetensors`	Model weights (denoiser)
`config.yaml`	Full model + training configuration
`rep_statistics.pt`	Per-dimension mean/var for z-score normalization

Usage

Installation

git clone https://github.com/zhangshuibai/Steering-PLMs.git
cd Steering-PLMs
pip install torch esm safetensors omegaconf einops diffusers

1. Load the Model

import sys, os
sys.path.insert(0, "generative_latent_prior")

from glp.denoiser import load_glp

# Load from local path (after downloading)
model = load_glp("generative_latent_prior/runs/glp-esm2-650m-layer17-d6", device="cuda:0")

# Or load from HuggingFace directly
model = load_glp("Shuibai12138/glp-esm2-650m-layer17", device="cuda:0")

2. Generate Activations from Noise (Unconditional Sampling)

Sample synthetic ESM2-650M Layer 17 activations from the learned distribution:

import torch
from glp import flow_matching

# Sample from noise
noise = torch.randn(100, 1, 1280).to("cuda:0")  # 100 samples
gen_acts = flow_matching.sample(model, noise, num_timesteps=100)

# Denormalize back to ESM2 activation space
gen_acts = model.normalizer.denormalize(gen_acts)  # (100, 1, 1280)

3. On-Manifold Projection (SDEdit for Protein Steering)

The primary use case: after applying a steering vector to ESM2 activations, project the steered activations back onto the natural protein manifold to maintain sequence naturalness.

from glp import flow_matching

def project_on_manifold(model, acts, u=0.5, num_timesteps=20):
    """
    SDEdit-style projection.

    Args:
        model: loaded GLP model
        acts: (B, 1, 1280) raw ESM2 activations (possibly steered)
        u: noise level (0=no change, 0.5=moderate, 1.0=full reconstruction)
        num_timesteps: denoising steps

    Returns:
        (B, 1, 1280) on-manifold activations
    """
    model.scheduler.set_timesteps(num_timesteps)

    # Normalize
    latents = model.normalizer.normalize(acts)

    # Add noise to timestep u
    noise = torch.randn_like(latents)
    noisy_latents, _, timesteps, _ = flow_matching.fm_prepare(
        model.scheduler, latents, noise,
        u=torch.ones(latents.shape[0], device=latents.device) * u,
    )

    # Denoise (SDEdit)
    latents = flow_matching.sample_on_manifold(
        model, noisy_latents,
        start_timestep=timesteps[0].item(),
        num_timesteps=num_timesteps,
    )

    # Denormalize
    return model.normalizer.denormalize(latents)

4. Full Steering + GLP Pipeline

Generate steered protein sequences with on-manifold projection:

python steering_with_glp.py \
    --glp_path generative_latent_prior/runs/glp-esm2-650m-layer17-d6 \
    --gpu_gen cuda:0 --gpu_ppl 0 1 2 3 \
    --n_gen 100 --u 0.5

This script:

Loads ESM2-650M and applies steering vectors at Layer 17
Projects steered activations on-manifold via GLP SDEdit
Generates protein sequences via iterative mask-predict
Evaluates solubility (oracle) and naturalness (pPPL with ESM2-3B)

Key Code Files

File	Role
`steering_with_glp.py`	Main entry: steering + GLP projection + sequence generation + evaluation
`generative_latent_prior/glp/denoiser.py`	Model definition: `Normalizer`, `Denoiser`, `GLP`, `load_glp()`
`generative_latent_prior/glp/flow_matching.py`	`fm_prepare()`, `sample()`, `sample_on_manifold()`
`generative_latent_prior/glp/script_steer.py`	Generic steering utilities: `addition_intervention()`, `postprocess_on_manifold_wrapper()`
`generative_latent_prior/glp/script_eval.py`	FID evaluation and PCA visualization

Training

This model was trained using the offline pipeline (glp_train.py), which:

Pre-extracts ESM2-650M Layer 17 activations from UniRef50 to disk (~1.1TB)
Computes per-dimension mean/var statistics (rep_statistics.pt)
Trains the GLP denoiser via flow matching on the cached activations

For training details, see:

Training script: generative_latent_prior/glp_train.py
Evaluation metrics doc: generative_latent_prior/docs/eval_metrics.md
Config used: config.yaml in this repo

How GLP Works

GLP (Generative Latent Prior) learns the distribution of ESM2 internal activations using flow matching:

Training:    Real ESM2 activations → z-score normalize → flow matching loss (predict velocity field)
Sampling:    Gaussian noise → denoise with learned velocity → denormalize → synthetic activations
SDEdit:      Steered activation → normalize → add noise (level u) → denoise → denormalize → on-manifold activation

The key insight: steering vectors can push activations off the natural protein manifold, degrading sequence quality. GLP's SDEdit projection pulls them back while preserving the steering direction.

Citation

@misc{steering-plms,
  title={Steering Protein Language Models},
  author={Zhang, Shuibai},
  year={2025},
  url={https://github.com/zhangshuibai/Steering-PLMs}
}

Downloads last month: 4

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support