GLP-ESM2-650M-Layer17
A Generative Latent Prior (GLP) model trained on ESM2-650M Layer 17 activations from UniRef50 protein sequences. GLP learns the distribution of protein representations via flow matching, enabling on-manifold projection during protein sequence steering.
Model Details
| Property | Value |
|---|---|
| Base PLM | ESM2-650M (esm2_t33_650M_UR50D) |
| Extraction Layer | 17 (of 33) |
| Input Dimension | 1280 |
| Denoiser Architecture | 6-layer SwiGLU MLP, d_model=2560, d_mlp=5120 |
| Denoiser Parameters | ~335M |
| Training Data | UniRef50 (~58M sequences, offline extraction) |
| Training | 1 epoch, batch_size=16384, lr=5e-5, cosine schedule |
| Framework | Flow Matching (FlowMatchEulerDiscreteScheduler) |
Evaluation Metrics
| Metric | Value |
|---|---|
| Loss | 0.949 |
| FID (PCA-128d) | 189.6 |
| MMD (RBF) | 0.070 |
| NLL (Hutchinson) | 805.6 |
| gen_mean_err | 0.152 |
| gen_std_err | 0.125 |
Files
| File | Description |
|---|---|
final.safetensors |
Model weights (denoiser) |
config.yaml |
Full model + training configuration |
rep_statistics.pt |
Per-dimension mean/var for z-score normalization |
Usage
Installation
git clone https://github.com/zhangshuibai/Steering-PLMs.git
cd Steering-PLMs
pip install torch esm safetensors omegaconf einops diffusers
1. Load the Model
import sys, os
sys.path.insert(0, "generative_latent_prior")
from glp.denoiser import load_glp
# Load from local path (after downloading)
model = load_glp("generative_latent_prior/runs/glp-esm2-650m-layer17-d6", device="cuda:0")
# Or load from HuggingFace directly
model = load_glp("Shuibai12138/glp-esm2-650m-layer17", device="cuda:0")
2. Generate Activations from Noise (Unconditional Sampling)
Sample synthetic ESM2-650M Layer 17 activations from the learned distribution:
import torch
from glp import flow_matching
# Sample from noise
noise = torch.randn(100, 1, 1280).to("cuda:0") # 100 samples
gen_acts = flow_matching.sample(model, noise, num_timesteps=100)
# Denormalize back to ESM2 activation space
gen_acts = model.normalizer.denormalize(gen_acts) # (100, 1, 1280)
3. On-Manifold Projection (SDEdit for Protein Steering)
The primary use case: after applying a steering vector to ESM2 activations, project the steered activations back onto the natural protein manifold to maintain sequence naturalness.
from glp import flow_matching
def project_on_manifold(model, acts, u=0.5, num_timesteps=20):
"""
SDEdit-style projection.
Args:
model: loaded GLP model
acts: (B, 1, 1280) raw ESM2 activations (possibly steered)
u: noise level (0=no change, 0.5=moderate, 1.0=full reconstruction)
num_timesteps: denoising steps
Returns:
(B, 1, 1280) on-manifold activations
"""
model.scheduler.set_timesteps(num_timesteps)
# Normalize
latents = model.normalizer.normalize(acts)
# Add noise to timestep u
noise = torch.randn_like(latents)
noisy_latents, _, timesteps, _ = flow_matching.fm_prepare(
model.scheduler, latents, noise,
u=torch.ones(latents.shape[0], device=latents.device) * u,
)
# Denoise (SDEdit)
latents = flow_matching.sample_on_manifold(
model, noisy_latents,
start_timestep=timesteps[0].item(),
num_timesteps=num_timesteps,
)
# Denormalize
return model.normalizer.denormalize(latents)
4. Full Steering + GLP Pipeline
Generate steered protein sequences with on-manifold projection:
python steering_with_glp.py \
--glp_path generative_latent_prior/runs/glp-esm2-650m-layer17-d6 \
--gpu_gen cuda:0 --gpu_ppl 0 1 2 3 \
--n_gen 100 --u 0.5
This script:
- Loads ESM2-650M and applies steering vectors at Layer 17
- Projects steered activations on-manifold via GLP SDEdit
- Generates protein sequences via iterative mask-predict
- Evaluates solubility (oracle) and naturalness (pPPL with ESM2-3B)
Key Code Files
| File | Role |
|---|---|
steering_with_glp.py |
Main entry: steering + GLP projection + sequence generation + evaluation |
generative_latent_prior/glp/denoiser.py |
Model definition: Normalizer, Denoiser, GLP, load_glp() |
generative_latent_prior/glp/flow_matching.py |
fm_prepare(), sample(), sample_on_manifold() |
generative_latent_prior/glp/script_steer.py |
Generic steering utilities: addition_intervention(), postprocess_on_manifold_wrapper() |
generative_latent_prior/glp/script_eval.py |
FID evaluation and PCA visualization |
Training
This model was trained using the offline pipeline (glp_train.py), which:
- Pre-extracts ESM2-650M Layer 17 activations from UniRef50 to disk (~1.1TB)
- Computes per-dimension mean/var statistics (
rep_statistics.pt) - Trains the GLP denoiser via flow matching on the cached activations
For training details, see:
- Training script:
generative_latent_prior/glp_train.py - Evaluation metrics doc:
generative_latent_prior/docs/eval_metrics.md - Config used:
config.yamlin this repo
How GLP Works
GLP (Generative Latent Prior) learns the distribution of ESM2 internal activations using flow matching:
Training: Real ESM2 activations β z-score normalize β flow matching loss (predict velocity field)
Sampling: Gaussian noise β denoise with learned velocity β denormalize β synthetic activations
SDEdit: Steered activation β normalize β add noise (level u) β denoise β denormalize β on-manifold activation
The key insight: steering vectors can push activations off the natural protein manifold, degrading sequence quality. GLP's SDEdit projection pulls them back while preserving the steering direction.
Citation
@misc{steering-plms,
title={Steering Protein Language Models},
author={Zhang, Shuibai},
year={2025},
url={https://github.com/zhangshuibai/Steering-PLMs}
}
- Downloads last month
- 4