Add model card and usage information

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +83 -3
README.md CHANGED
@@ -1,3 +1,83 @@
1
- ---
2
- license: cc-by-nc-sa-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - biology
6
+ - protein
7
+ ---
8
+
9
+ # Proust v0
10
+
11
+ Proust is a 309M-parameter causal protein language model (PLM) introduced in the paper [No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation](https://huggingface.co/papers/2602.01845).
12
+
13
+ The model bridges the divide between masked language models (MLMs), which excel at fitness prediction, and causal models, which enable generation. Proust achieves competitive performance on ProteinGym benchmarks while retaining native generative capabilities.
14
+
15
+ - **Paper:** [No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation](https://huggingface.co/papers/2602.01845)
16
+ - **Code:** [Furkan9015/proust-inference](https://github.com/Furkan9015/proust-inference)
17
+
18
+ ## Model Details
19
+
20
+ - **Architecture:** GQA-S2 Transformer (Grouped Query Attention with S2 KV-sharing and VO-RoPE)
21
+ - **Parameters:** 309M
22
+ - **Configuration:** 24 layers, 1024 hidden dimensions, 16 heads, 2 KV heads
23
+ - **Vocab:** 32 tokens (ESM-style)
24
+ - **Innovations:** Cross-layer value residuals and depthwise causal convolutions
25
+
26
+ ## Usage
27
+
28
+ To use this model, please follow the installation instructions in the [official GitHub repository](https://github.com/Furkan9015/proust-inference).
29
+
30
+ ### Load Model
31
+
32
+ ```python
33
+ from proust_inference import load_model
34
+
35
+ # Downloads checkpoint from HuggingFace on first call, loads to cuda in bfloat16
36
+ model = load_model()
37
+ ```
38
+
39
+ ### Score a protein sequence (log-likelihood)
40
+
41
+ ```python
42
+ import torch
43
+ from proust_inference import load_model, tokenize
44
+
45
+ model = load_model()
46
+
47
+ ids = tokenize("MKTLLILAVLCLGFASSALA", device="cuda")
48
+ with torch.no_grad():
49
+ logits = model(ids.unsqueeze(0)) # (1, seq_len, vocab_size)
50
+
51
+ # Per-token log probabilities
52
+ log_probs = logits.float().log_softmax(dim=-1)
53
+ # Shift: predict token t+1 from position t
54
+ token_log_probs = log_probs[0, :-1].gather(1, ids[1:].unsqueeze(1)).squeeze(1)
55
+ print(f"Mean log-likelihood: {token_log_probs.mean().item():.4f}")
56
+ ```
57
+
58
+ ### Extract embeddings
59
+
60
+ ```python
61
+ import torch
62
+ from proust_inference import load_model, tokenize
63
+
64
+ model = load_model()
65
+
66
+ ids = tokenize("MKTLLILAVLCLGFASSALA", device="cuda")
67
+ with torch.no_grad():
68
+ hidden = model.get_embeddings(ids.unsqueeze(0)) # (1, seq_len, 1024)
69
+
70
+ # Mean pooling (excluding <cls> and <eos>)
71
+ embedding = hidden[0, 1:-1].mean(dim=0) # (1024,)
72
+ ```
73
+
74
+ ## Citation
75
+
76
+ ```bibtex
77
+ @article{proust2025,
78
+ title={No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation},
79
+ author={Nappenstance Authors},
80
+ journal={arXiv preprint arXiv:2602.01845},
81
+ year={2025}
82
+ }
83
+ ```