lhallee commited on
Commit
c9346a6
·
verified ·
1 Parent(s): ee18929

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +260 -260
README.md CHANGED
@@ -1,260 +1,260 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- tags:
5
- - biology
6
- - esm
7
- - protein
8
- - protein-language-model
9
- - masked-language-modeling
10
- ---
11
-
12
- # ESM++ 6B
13
-
14
- [ESM++](https://github.com/Synthyra/FastPLMs) is a Hugging Face compatible implementation of [Biohub ESMC](https://biohub.ai/esm/protein) ([license](https://github.com/Biohub/esm/blob/main/LICENSE.md)).
15
- This checkpoint corresponds to the 6 billion parameter ESMC model released as [`biohub/ESMC-6B`](https://huggingface.co/biohub/ESMC-6B).
16
-
17
- This repository includes the Biohub ESM MIT license in `LICENSE`.
18
-
19
- The 6B model has 80 transformer layers, hidden size 2560, and 40 attention heads. It is large enough that `dtype=torch.bfloat16` or `torch.float16` plus `device_map="auto"` is usually the practical loading path.
20
-
21
- ## Attention Backends
22
-
23
- `sdpa` is the default backend. Set `config.attn_backend` before loading if you want a different attention implementation.
24
-
25
- | Backend | Key | Notes |
26
- | :--- | :--- | :--- |
27
- | PyTorch SDPA | `"sdpa"` | Default. Exact numerics and stable on all hardware. |
28
- | Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs when `kernels` is installed. Outputs are not bitwise identical to SDPA. |
29
- | Flex Attention | `"flex"` | Skips padding tokens via block masks. First use compiles a Triton kernel. |
30
- | Auto | `"auto"` | Picks the best available backend: `kernels_flash`, then `flex`, then `sdpa`. |
31
-
32
- ```python
33
- import torch
34
- from transformers import AutoConfig, AutoModelForMaskedLM
35
-
36
- config = AutoConfig.from_pretrained(
37
- "Synthyra/ESMplusplus_6B",
38
- trust_remote_code=True,
39
- )
40
- config.attn_backend = "auto"
41
-
42
- model = AutoModelForMaskedLM.from_pretrained(
43
- "Synthyra/ESMplusplus_6B",
44
- config=config,
45
- trust_remote_code=True,
46
- dtype=torch.bfloat16,
47
- device_map="auto",
48
- )
49
- ```
50
-
51
- ## Masked Language Modeling
52
-
53
- ```python
54
- import torch
55
- from transformers import AutoModelForMaskedLM
56
-
57
- model = AutoModelForMaskedLM.from_pretrained(
58
- "Synthyra/ESMplusplus_6B",
59
- trust_remote_code=True,
60
- dtype=torch.bfloat16,
61
- device_map="auto",
62
- )
63
- tokenizer = model.tokenizer
64
-
65
- sequences = ["MPRTEIN", "MSEQWENCE"]
66
- inputs = tokenizer(sequences, padding=True, return_tensors="pt")
67
- inputs = inputs.to(model.device)
68
-
69
- with torch.no_grad():
70
- output = model(**inputs)
71
-
72
- print(output.logits.shape)
73
- print(output.last_hidden_state.shape)
74
- ```
75
-
76
- Pass `output_hidden_states=True` if you need all intermediate hidden states.
77
-
78
- ## Embed Datasets
79
-
80
- All FastPLMs sequence models include `embed_dataset`, which handles batching, length sorting, pooling, FASTA parsing, optional resume from existing outputs, and `.pth` or SQLite storage.
81
-
82
- ```python
83
- import torch
84
- from transformers import AutoModelForMaskedLM
85
-
86
- model = AutoModelForMaskedLM.from_pretrained(
87
- "Synthyra/ESMplusplus_6B",
88
- trust_remote_code=True,
89
- dtype=torch.bfloat16,
90
- device_map="auto",
91
- )
92
-
93
- embedding_dict = model.embed_dataset(
94
- sequences=[
95
- "MALWMRLLPLLALLALWGPDPAAA",
96
- "MSEQWENCE",
97
- "MPRTEIN",
98
- ],
99
- batch_size=1,
100
- max_len=1024,
101
- full_embeddings=False,
102
- embed_dtype=torch.float32,
103
- pooling_types=["mean", "cls"],
104
- num_workers=0,
105
- save=True,
106
- save_path="esmplusplus_6b_embeddings.pth",
107
- )
108
-
109
- print(embedding_dict["MPRTEIN"].shape)
110
- ```
111
-
112
- For residue-level embeddings, set `full_embeddings=True`:
113
-
114
- ```python
115
- residue_embeddings = model.embed_dataset(
116
- sequences=["MALWMRLLPLLALLALWGPDPAAA"],
117
- batch_size=1,
118
- max_len=1024,
119
- full_embeddings=True,
120
- embed_dtype=torch.float32,
121
- save=False,
122
- )
123
- ```
124
-
125
- For very large datasets, write embeddings directly to SQLite:
126
-
127
- ```python
128
- model.embed_dataset(
129
- fasta_path="proteins.fasta",
130
- batch_size=1,
131
- max_len=1024,
132
- pooling_types=["mean"],
133
- sql=True,
134
- sql_db_path="esmplusplus_6b_embeddings.db",
135
- save=False,
136
- )
137
- ```
138
-
139
- `embed_dataset` returns a dictionary when `sql=False`. With `sql=True`, embeddings are written to the database and loaded as needed.
140
-
141
- ## Classification Heads
142
-
143
- ESM++ supports sequence-level and token-level classification through the standard Transformers auto classes.
144
-
145
- ```python
146
- import torch
147
- from transformers import AutoModelForSequenceClassification
148
-
149
- model = AutoModelForSequenceClassification.from_pretrained(
150
- "Synthyra/ESMplusplus_6B",
151
- num_labels=2,
152
- trust_remote_code=True,
153
- dtype=torch.bfloat16,
154
- device_map="auto",
155
- )
156
-
157
- tokenized = model.tokenizer(
158
- ["MPRTEIN", "MSEQWENCE"],
159
- padding=True,
160
- return_tensors="pt",
161
- ).to(model.device)
162
-
163
- with torch.no_grad():
164
- logits = model(**tokenized).logits
165
-
166
- print(logits.shape)
167
- ```
168
-
169
- ## LoRA Fine-Tuning
170
-
171
- ```python
172
- from peft import LoraConfig, get_peft_model
173
- from transformers import AutoModelForSequenceClassification
174
-
175
- model = AutoModelForSequenceClassification.from_pretrained(
176
- "Synthyra/ESMplusplus_6B",
177
- num_labels=2,
178
- trust_remote_code=True,
179
- dtype=torch.bfloat16,
180
- device_map="auto",
181
- )
182
-
183
- lora_config = LoraConfig(
184
- r=8,
185
- lora_alpha=16,
186
- lora_dropout=0.01,
187
- bias="none",
188
- target_modules=[
189
- "layernorm_qkv.1",
190
- "out_proj",
191
- "query",
192
- "key",
193
- "value",
194
- "dense",
195
- ],
196
- )
197
-
198
- model = get_peft_model(model, lora_config)
199
- ```
200
-
201
- ## Attention Maps
202
-
203
- Optimized attention backends do not return attention maps directly. ESM++ can compute them manually with `output_attentions=True`, but this is much slower and memory-heavy for the 6B model.
204
-
205
- ```python
206
- with torch.no_grad():
207
- output = model(**inputs, output_attentions=True)
208
-
209
- attentions = output.attentions
210
- print(len(attentions))
211
- print(attentions[0].shape)
212
- ```
213
-
214
- ## Load Biohub Source Weights
215
-
216
- You can also load the Biohub source weights directly through FastPLMs:
217
-
218
- ```python
219
- from fastplms.esm_plusplus.modeling_esm_plusplus import ESMplusplusForMaskedLM
220
-
221
- model = ESMplusplusForMaskedLM.from_pretrained_esm("esmc-6b")
222
- ```
223
-
224
- The source repository is [`biohub/ESMC-6B`](https://huggingface.co/biohub/ESMC-6B).
225
- The Biohub ESM license is available at https://github.com/Biohub/esm/blob/main/LICENSE.md.
226
-
227
- ## Citation
228
-
229
- ```bibtex
230
- @misc{FastPLMs,
231
- author={Hallee, Logan and Bichara, David and Gleghorn, Jason P.},
232
- title={FastPLMs: Fast, efficient, protein language model inference from Hugging Face AutoModel.},
233
- year={2024},
234
- url={https://huggingface.co/Synthyra/ESMplusplus_6B},
235
- DOI={10.57967/hf/3726},
236
- publisher={Hugging Face}
237
- }
238
- ```
239
-
240
- ```bibtex
241
- @misc{candido2026language,
242
- title = {Language Modeling Materializes a World Model of Protein Biology},
243
- author = {Candido, Salvatore and Hayes, Thomas and Derry, Alexander and Rao, Roshan
244
- and Lin, Zeming and Verkuil, Robert and Wu, Bryan and Lee, Jin Sub
245
- and Bruguera, Elise S. and Keval, Jehan A. and Kopylov, Mykhailo
246
- and Pak, John E. and Wu, Wesley and Thomas, Neil and Mataraso, Samson
247
- and Hsu, Alvin and Trotman-Grant, Ashton C. and Fatras, Kilian
248
- and dos Santos Costa, Allan and Badkundri, Rohil and Ak{\i}n, Halil
249
- and Oktay, Deniz and Deaton, Jonathan and Montabana, Elizabeth
250
- and Sitwala, Hrishita and Yu, Yue and Wiggert, Marius
251
- and Carlin, Dylan Alexander and Goering, Anthony W. and Blazejewski, Tomasz
252
- and Sandora, McCullen and Hla, Michael and Jia, Tina Z.
253
- and Kloker, Leon H. and Sofroniew, Nicholas J. and Uehara, Masatoshi
254
- and Pannu, Jassi and Bachas, Sharrol and Liu, Daniel S.
255
- and Sercu, Tom and Rives, Alexander},
256
- year = {2026},
257
- url = {https://biohub.ai/papers/esm_protein.pdf},
258
- note = {Preprint}
259
- }
260
- ```
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ tags:
5
+ - biology
6
+ - esm
7
+ - protein
8
+ - protein-language-model
9
+ - masked-language-modeling
10
+ ---
11
+
12
+ # ESM++ 6B
13
+
14
+ [ESM++](https://github.com/Synthyra/FastPLMs) is a Hugging Face compatible implementation of [Biohub ESMC](https://biohub.ai/esm/protein) ([license](https://github.com/Biohub/esm/blob/main/LICENSE.md)).
15
+ This checkpoint corresponds to the 6 billion parameter ESMC model released as [`biohub/ESMC-6B`](https://huggingface.co/biohub/ESMC-6B).
16
+
17
+ This repository includes the Biohub ESM MIT license in `LICENSE`.
18
+
19
+ The 6B model has 80 transformer layers, hidden size 2560, and 40 attention heads. It is large enough that `dtype=torch.bfloat16` or `torch.float16` plus `device_map="auto"` is usually the practical loading path.
20
+
21
+ ## Attention Backends
22
+
23
+ `sdpa` is the default backend. Set `config.attn_backend` before loading if you want a different attention implementation.
24
+
25
+ | Backend | Key | Notes |
26
+ | :--- | :--- | :--- |
27
+ | PyTorch SDPA | `"sdpa"` | Default. Exact numerics and stable on all hardware. |
28
+ | Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs when `kernels` is installed. Outputs are not bitwise identical to SDPA. |
29
+ | Flex Attention | `"flex"` | Skips padding tokens via block masks. First use compiles a Triton kernel. |
30
+ | Auto | `"auto"` | Picks the best available backend: `kernels_flash`, then `flex`, then `sdpa`. |
31
+
32
+ ```python
33
+ import torch
34
+ from transformers import AutoConfig, AutoModelForMaskedLM
35
+
36
+ config = AutoConfig.from_pretrained(
37
+ "Synthyra/ESMplusplus_6B",
38
+ trust_remote_code=True,
39
+ )
40
+ config.attn_backend = "auto"
41
+
42
+ model = AutoModelForMaskedLM.from_pretrained(
43
+ "Synthyra/ESMplusplus_6B",
44
+ config=config,
45
+ trust_remote_code=True,
46
+ dtype=torch.bfloat16,
47
+ device_map="auto",
48
+ )
49
+ ```
50
+
51
+ ## Masked Language Modeling
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoModelForMaskedLM
56
+
57
+ model = AutoModelForMaskedLM.from_pretrained(
58
+ "Synthyra/ESMplusplus_6B",
59
+ trust_remote_code=True,
60
+ dtype=torch.bfloat16,
61
+ device_map="auto",
62
+ )
63
+ tokenizer = model.tokenizer
64
+
65
+ sequences = ["MPRTEIN", "MSEQWENCE"]
66
+ inputs = tokenizer(sequences, padding=True, return_tensors="pt")
67
+ inputs = inputs.to(model.device)
68
+
69
+ with torch.no_grad():
70
+ output = model(**inputs)
71
+
72
+ print(output.logits.shape)
73
+ print(output.last_hidden_state.shape)
74
+ ```
75
+
76
+ Pass `output_hidden_states=True` if you need all intermediate hidden states.
77
+
78
+ ## Embed Datasets
79
+
80
+ All FastPLMs sequence models include `embed_dataset`, which handles batching, length sorting, pooling, FASTA parsing, optional resume from existing outputs, and `.pth` or SQLite storage.
81
+
82
+ ```python
83
+ import torch
84
+ from transformers import AutoModelForMaskedLM
85
+
86
+ model = AutoModelForMaskedLM.from_pretrained(
87
+ "Synthyra/ESMplusplus_6B",
88
+ trust_remote_code=True,
89
+ dtype=torch.bfloat16,
90
+ device_map="auto",
91
+ )
92
+
93
+ embedding_dict = model.embed_dataset(
94
+ sequences=[
95
+ "MALWMRLLPLLALLALWGPDPAAA",
96
+ "MSEQWENCE",
97
+ "MPRTEIN",
98
+ ],
99
+ batch_size=1,
100
+ max_len=1024,
101
+ full_embeddings=False,
102
+ embed_dtype=torch.float32,
103
+ pooling_types=["mean", "cls"],
104
+ num_workers=0,
105
+ save=True,
106
+ save_path="esmplusplus_6b_embeddings.pth",
107
+ )
108
+
109
+ print(embedding_dict["MPRTEIN"].shape)
110
+ ```
111
+
112
+ For residue-level embeddings, set `full_embeddings=True`:
113
+
114
+ ```python
115
+ residue_embeddings = model.embed_dataset(
116
+ sequences=["MALWMRLLPLLALLALWGPDPAAA"],
117
+ batch_size=1,
118
+ max_len=1024,
119
+ full_embeddings=True,
120
+ embed_dtype=torch.float32,
121
+ save=False,
122
+ )
123
+ ```
124
+
125
+ For very large datasets, write embeddings directly to SQLite:
126
+
127
+ ```python
128
+ model.embed_dataset(
129
+ fasta_path="proteins.fasta",
130
+ batch_size=1,
131
+ max_len=1024,
132
+ pooling_types=["mean"],
133
+ sql=True,
134
+ sql_db_path="esmplusplus_6b_embeddings.db",
135
+ save=False,
136
+ )
137
+ ```
138
+
139
+ `embed_dataset` returns a dictionary when `sql=False`. With `sql=True`, embeddings are written to the database and loaded as needed.
140
+
141
+ ## Classification Heads
142
+
143
+ ESM++ supports sequence-level and token-level classification through the standard Transformers auto classes.
144
+
145
+ ```python
146
+ import torch
147
+ from transformers import AutoModelForSequenceClassification
148
+
149
+ model = AutoModelForSequenceClassification.from_pretrained(
150
+ "Synthyra/ESMplusplus_6B",
151
+ num_labels=2,
152
+ trust_remote_code=True,
153
+ dtype=torch.bfloat16,
154
+ device_map="auto",
155
+ )
156
+
157
+ tokenized = model.tokenizer(
158
+ ["MPRTEIN", "MSEQWENCE"],
159
+ padding=True,
160
+ return_tensors="pt",
161
+ ).to(model.device)
162
+
163
+ with torch.no_grad():
164
+ logits = model(**tokenized).logits
165
+
166
+ print(logits.shape)
167
+ ```
168
+
169
+ ## LoRA Fine-Tuning
170
+
171
+ ```python
172
+ from peft import LoraConfig, get_peft_model
173
+ from transformers import AutoModelForSequenceClassification
174
+
175
+ model = AutoModelForSequenceClassification.from_pretrained(
176
+ "Synthyra/ESMplusplus_6B",
177
+ num_labels=2,
178
+ trust_remote_code=True,
179
+ dtype=torch.bfloat16,
180
+ device_map="auto",
181
+ )
182
+
183
+ lora_config = LoraConfig(
184
+ r=8,
185
+ lora_alpha=16,
186
+ lora_dropout=0.01,
187
+ bias="none",
188
+ target_modules=[
189
+ "layernorm_qkv.1",
190
+ "out_proj",
191
+ "query",
192
+ "key",
193
+ "value",
194
+ "dense",
195
+ ],
196
+ )
197
+
198
+ model = get_peft_model(model, lora_config)
199
+ ```
200
+
201
+ ## Attention Maps
202
+
203
+ Optimized attention backends do not return attention maps directly. ESM++ can compute them manually with `output_attentions=True`, but this is much slower and memory-heavy for the 6B model.
204
+
205
+ ```python
206
+ with torch.no_grad():
207
+ output = model(**inputs, output_attentions=True)
208
+
209
+ attentions = output.attentions
210
+ print(len(attentions))
211
+ print(attentions[0].shape)
212
+ ```
213
+
214
+ ## Load Biohub Source Weights
215
+
216
+ You can also load the Biohub source weights directly through FastPLMs:
217
+
218
+ ```python
219
+ from fastplms.esm_plusplus.modeling_esm_plusplus import ESMplusplusForMaskedLM
220
+
221
+ model = ESMplusplusForMaskedLM.from_pretrained_esm("esmc-6b")
222
+ ```
223
+
224
+ The source repository is [`biohub/ESMC-6B`](https://huggingface.co/biohub/ESMC-6B).
225
+ The Biohub ESM license is available at https://github.com/Biohub/esm/blob/main/LICENSE.md.
226
+
227
+ ## Citation
228
+
229
+ ```bibtex
230
+ @misc{FastPLMs,
231
+ author={Hallee, Logan and Bichara, David and Gleghorn, Jason P.},
232
+ title={FastPLMs: Fast, efficient, protein language model inference from Hugging Face AutoModel.},
233
+ year={2024},
234
+ url={https://huggingface.co/Synthyra/ESMplusplus_6B},
235
+ DOI={10.57967/hf/3726},
236
+ publisher={Hugging Face}
237
+ }
238
+ ```
239
+
240
+ ```bibtex
241
+ @misc{candido2026language,
242
+ title = {Language Modeling Materializes a World Model of Protein Biology},
243
+ author = {Candido, Salvatore and Hayes, Thomas and Derry, Alexander and Rao, Roshan
244
+ and Lin, Zeming and Verkuil, Robert and Wu, Bryan and Lee, Jin Sub
245
+ and Bruguera, Elise S. and Keval, Jehan A. and Kopylov, Mykhailo
246
+ and Pak, John E. and Wu, Wesley and Thomas, Neil and Mataraso, Samson
247
+ and Hsu, Alvin and Trotman-Grant, Ashton C. and Fatras, Kilian
248
+ and dos Santos Costa, Allan and Badkundri, Rohil and Ak{\i}n, Halil
249
+ and Oktay, Deniz and Deaton, Jonathan and Montabana, Elizabeth
250
+ and Sitwala, Hrishita and Yu, Yue and Wiggert, Marius
251
+ and Carlin, Dylan Alexander and Goering, Anthony W. and Blazejewski, Tomasz
252
+ and Sandora, McCullen and Hla, Michael and Jia, Tina Z.
253
+ and Kloker, Leon H. and Sofroniew, Nicholas J. and Uehara, Masatoshi
254
+ and Pannu, Jassi and Bachas, Sharrol and Liu, Daniel S.
255
+ and Sercu, Tom and Rives, Alexander},
256
+ year = {2026},
257
+ url = {https://biohub.ai/papers/esm_protein.pdf},
258
+ note = {Preprint}
259
+ }
260
+ ```