No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation
Paper
•
2602.01845
•
Published
Proust is a 309M-parameter causal protein language model (PLM) introduced in the paper No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation.
The model bridges the divide between masked language models (MLMs), which excel at fitness prediction, and causal models, which enable generation. Proust achieves competitive performance on ProteinGym benchmarks while retaining native generative capabilities.
To use this model, please follow the installation instructions in the official GitHub repository.
from proust_inference import load_model
# Downloads checkpoint from HuggingFace on first call, loads to cuda in bfloat16
model = load_model()
import torch
from proust_inference import load_model, tokenize
model = load_model()
ids = tokenize("MKTLLILAVLCLGFASSALA", device="cuda")
with torch.no_grad():
logits = model(ids.unsqueeze(0)) # (1, seq_len, vocab_size)
# Per-token log probabilities
log_probs = logits.float().log_softmax(dim=-1)
# Shift: predict token t+1 from position t
token_log_probs = log_probs[0, :-1].gather(1, ids[1:].unsqueeze(1)).squeeze(1)
print(f"Mean log-likelihood: {token_log_probs.mean().item():.4f}")
import torch
from proust_inference import load_model, tokenize
model = load_model()
ids = tokenize("MKTLLILAVLCLGFASSALA", device="cuda")
with torch.no_grad():
hidden = model.get_embeddings(ids.unsqueeze(0)) # (1, seq_len, 1024)
# Mean pooling (excluding <cls> and <eos>)
embedding = hidden[0, 1:-1].mean(dim=0) # (1024,)
@article{proust2026,
title={No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation},
author={Furkan Eris},
journal={arXiv preprint arXiv:2602.01845},
year={2026}
}