Synthyra
/

FastESM2_650

@@ -2,37 +2,72 @@
 library_name: transformers
 tags: []
 ---
 # FastESM
-## A faster half-precision version of ESM2-650 with FlashAttention2 and longer context
-FastESM is a Huggingface compatible plug in version of ESM2-650M rewritten with a newer PyTorch attention implementation.
-To enhance the weights with longer context and better fp16 support, we trained ESM2-650 50000 additional steps with a traditional MLM objective (20% masking) in fp16 mixed precision on [OMGprot50](https://huggingface.co/datasets/tattabio/OMG_prot50) up to sequence length of **2048**.
 Outputting attention maps (or the contact prediction head) is not natively possible with SDPA. You can still pass ```output_attentions``` to have attention calculated manually and returned.
 Various other optimizations also make the base implementation slightly different than the one in transformers.
 ## Use with 🤗 transformers
 ```python
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
 model_path = 'Synthyra/FastESM2_650'
-model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 sequences = ['MPRTEIN', 'MSEQWENCE']
 tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
-    embeddings = model(**tokenized, output_hidden_states=True).hidden_states[-1]
 print(embeddings.shape) # (1, 11, 1280)
 ```
-Please note that FastESM does not currently work with AutoModel.
-If you would like to train a model from scratch without a language modeling head you can still use the base code, but if you load the weights with AutoModel they will not map correctly.
-AutoModelForSequenceClassification and AutoModelForTokenClassification are working as intended.
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
@@ -60,6 +95,7 @@ _ = model.embed_dataset(
     sql_db_path='embeddings.db', # path to .db file of choice
 )
 ```
 ## Model probes
 We employ linear probing techniques on various PLMs and standard datasets, similar our previous [paper](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1), to assess the intrinsic correlation between pooled hidden states and valuable properties. FastESM performs very well.

 library_name: transformers
 tags: []
 ---
 # FastESM
+FastESM is a Huggingface compatible plug in version of ESM2 rewritten with a newer PyTorch attention implementation.
+Load any ESM2 models into a FastEsm model to dramatically speed up training and inference without **ANY** cost in performance.
+## Use with 🤗 transformers
+```python
+from transformers import AutoModel, AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification # any of these work
+model_dict = {
+    'ESM2-8': 'facebook/esm2_t6_8M_UR50D',
+    'ESM2-35': 'facebook/esm2_t12_35M_UR50D',
+    'ESM2-150': 'facebook/esm2_t30_150M_UR50D',
+    'ESM2-650': 'facebook/esm2_t33_650M_UR50D',
+    'ESM2-3B': 'facebook/esm2_t36_3B_UR50D',
+    'ESM2-15B': 'facebook/esm2_t48_15B_UR50D',
+}
+model = AutoModelForMaskedLM.from_pretrained(model_dict['ESM2-8'], trust_remote_code=True)
+tokenizer = model.tokenizer
+```
 Outputting attention maps (or the contact prediction head) is not natively possible with SDPA. You can still pass ```output_attentions``` to have attention calculated manually and returned.
 Various other optimizations also make the base implementation slightly different than the one in transformers.
+# FastESM2-650
+## A faster half-precision version of ESM2-650 with FlashAttention2 and longer context
+To enhance the weights with longer context and better fp16 support, we trained ESM2-650 50000 additional steps with a traditional MLM objective (20% masking) in fp16 mixed precision on [OMGprot50](https://huggingface.co/datasets/tattabio/OMG_prot50) up to sequence length of **2048**.
 ## Use with 🤗 transformers
+### For working with embeddings
 ```python
 import torch
+from transformers import AutoModel, AutoTokenizer
 model_path = 'Synthyra/FastESM2_650'
+model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
+tokenizer = model.tokenizer
 sequences = ['MPRTEIN', 'MSEQWENCE']
 tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
+    embeddings = model(**tokenized).last_hidden_state
 print(embeddings.shape) # (1, 11, 1280)
 ```
+### For working with sequence logits
+```python
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model_path = 'Synthyra/FastESM2_650'
+model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
+tokenizer = model.tokenizer
+sequences = ['MPRTEIN', 'MSEQWENCE']
+tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
+with torch.no_grad():
+    logits = model(**tokenized).logits
+print(logits.shape) # (1, 11, 33)
+```
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
     sql_db_path='embeddings.db', # path to .db file of choice
 )
 ```
 ## Model probes
 We employ linear probing techniques on various PLMs and standard datasets, similar our previous [paper](https://www.biorxiv.org/content/10.1101/2024.07.30.605924v1), to assess the intrinsic correlation between pooled hidden states and valuable properties. FastESM performs very well.