lhallee commited on
Commit
535fb49
Β·
verified Β·
1 Parent(s): 157a612

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +174 -174
README.md CHANGED
@@ -1,175 +1,175 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # NOTE
7
- The GitHub with the implementation and requirements.txt can be found [here](https://github.com/Synthyra/FastPLMs.git)
8
-
9
- # Profluent-E1
10
- [Synthyra's version of Profluent-E1](https://github.com/Synthyra/Profluent-E1-300M) is a faithful implementation of Profluent's [E1](https://www.profluent.bio/showcase/e1) models ([license](https://github.com/Profluent-AI/E1/tree/main?tab=License-1-ov-file)) that integrates Huggingface AutoModel compatability and nice embedding functionality.
11
-
12
- ## Attention backends
13
-
14
- `sdpa` (PyTorch Scaled Dot Product Attention) is the default. The backend is set via `config.attn_backend` before loading.
15
-
16
- | Backend | Key | Notes |
17
- | :--- | :--- | :--- |
18
- | PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
19
- | Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs. Requires `pip install kernels` (pre-built β€” no hours-long compilation). Outputs are not bitwise identical to SDPA due to online softmax reordering; differences are often small but not guaranteed to be inconsequential β€” use `"sdpa"` if exact numerics matter. |
20
- | Flex Attention | `"flex"` | Uses a block-causal mask that skips padding tokens. Near-exact numerics. First use compiles a Triton kernel (30–120 s). Best combined with `torch.compile`. |
21
- | Auto | `"auto"` | Picks the best available: `kernels_flash` β†’ `flex` β†’ `sdpa`. |
22
-
23
- ```python
24
- from transformers import AutoConfig, AutoModelForMaskedLM
25
-
26
- config = AutoConfig.from_pretrained("Synthyra/Profluent-E1-150M", trust_remote_code=True)
27
- config.attn_backend = "flex" # or "kernels_flash", "sdpa", "auto"
28
- model = AutoModelForMaskedLM.from_pretrained("Synthyra/Profluent-E1-150M", config=config, trust_remote_code=True)
29
- ```
30
-
31
- `torch.compile(model)` is heavily recommended for sustained throughput, especially with Flex Attention.
32
-
33
-
34
- ## Use with πŸ€— transformers
35
- ### Supported models
36
- ```python
37
- model_dict = {
38
- # Synthyra/Profluent-E1-150M
39
- 'Profluent-E1-150M': 'Profluent-Bio/E1-150m',
40
- # Synthyra/Profluent-E1-150M
41
- 'Profluent-E1-300M': 'Profluent-Bio/E1-300m',
42
- # Synthyra/Profluent-E1-150M
43
- 'Profluent-E1-600M': 'Profluent-Bio/E1-600m',
44
- }
45
- ```
46
-
47
- ```python
48
- import torch
49
- from transformers import AutoModelForMaskedLM
50
-
51
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
- model = AutoModelForMaskedLM.from_pretrained('Synthyra/Profluent-E1-150M', trust_remote_code=True, dtype=torch.bfloat16).eval().to(device)
53
-
54
- sequences = ['MPRTEIN', 'MSEQWENCE']
55
- batch = model.prep_tokens.get_batch_kwargs(sequences, device=device)
56
-
57
- output = model(**batch) # get all hidden states with output_hidden_states=True
58
- print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 34)
59
- print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 768)
60
- print(output.loss) # language modeling loss if you passed labels
61
- #print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
62
- #print(outout.attentions) # all attention matrices if you passed output_attentions=True (in tuple)
63
- ```
64
-
65
- Our E1 implementation also supports sequence and token level classification tasks like ESM2. Simply pass the number of labels during initialization.
66
-
67
- ```python
68
- from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification
69
-
70
- model = AutoModelForSequenceClassification.from_pretrained('Synthyra/Profluent-E1-150M', num_labels=2, trust_remote_code=True)
71
- logits = model(**batch, labels=labels).logits
72
- print(logits.shape) # (batch_size, num_labels), (2, 2)
73
- ```
74
-
75
- E1 weights were trained in bf16 and are in bf16 by default. You can load them in the precision of your choosing by leveraging the dtype parameter:
76
- ```python
77
- import torch
78
- model = AutoModelForMaskedLM.from_pretrained('Synthyra/Profluent-E1-150M', trust_remote_code=True, dtype=torch.float) # fp32
79
- ```
80
-
81
- ## Embed entire datasets with no new code
82
- To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
83
-
84
- Example:
85
- ```python
86
- embedding_dict = model.embed_dataset(
87
- sequences=[
88
- 'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
89
- ],
90
- batch_size=2, # adjust for your GPU memory
91
- max_len=512, # adjust for your needs
92
- full_embeddings=False, # if True, no pooling is performed
93
- embed_dtype=torch.float32, # cast to what dtype you want
94
- pooling_types=['mean', 'cls'], # more than one pooling type will be concatenated together
95
- sql=False, # if True, embeddings will be stored in SQLite database
96
- sql_db_path='embeddings.db',
97
- save=True, # if True, embeddings will be saved as a .pth file
98
- save_path='embeddings.pth',
99
- )
100
- # embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
101
- ```
102
-
103
- ```
104
- model.embed_dataset()
105
- Args:
106
- sequences: List of protein sequences
107
- batch_size: Batch size for processing
108
- max_len: Maximum sequence length
109
- full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
110
- pooling_type: Type of pooling ('mean' or 'cls')
111
- sql: Whether to store embeddings in SQLite database - will be stored in float32
112
- sql_db_path: Path to SQLite database
113
-
114
- Returns:
115
- Dictionary mapping sequences to embeddings, or None if sql=True
116
-
117
- Note:
118
- - If sql=True, embeddings can only be stored in float32
119
- - sql is ideal if you need to stream a very large dataset for training in real-time
120
- - save=True is ideal if you can store the entire embedding dictionary in RAM
121
- - sql will be used if it is True and save is True or False
122
- - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
123
- - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
124
- ```
125
-
126
- ## Fine-tuning with πŸ€— peft
127
- ```python
128
- model = AutoModelForSequenceClassification.from_pretrained('Synthyra/Profluent-E1-150M', num_labels=2, trust_remote_code=True)
129
- # these modules handle E1 attention layers
130
- target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
131
-
132
- lora_config = LoraConfig(
133
- r=8, # choose lora parameters to your liking
134
- lora_alpha=16,
135
- lora_dropout=0.01,
136
- bias="none",
137
- target_modules=target_modules,
138
- )
139
-
140
- # Apply LoRA to the model
141
- model = get_peft_model(model, lora_config)
142
-
143
- # Unfreeze the classifier head
144
- for param in model.classifier.parameters():
145
- param.requires_grad = True
146
- ```
147
-
148
- For a more thourough example of fine-tuning, check out our example script [here](https://github.com/Synthyra/FastPLMs/blob/main/fine_tuning_example.py).
149
-
150
-
151
- ### Citation
152
- If you use any of this implementation or work please cite the following DOI and Profluent's paper.
153
-
154
- ```
155
- @misc {FastPLMs,
156
- author = { Hallee, Logan and Bichara, David and Gleghorn, Jason P.},
157
- title = { FastPLMs: Fast, efficient, protien language model inference from Huggingface AutoModel.},
158
- year = {2024},
159
- url = { https://huggingface.co/Synthyra/ESMplusplus_small },
160
- DOI = { 10.57967/hf/3726 },
161
- publisher = { Hugging Face }
162
- }
163
- ```
164
-
165
- ```
166
- @article{Jain_Beazer_Ruffolo_Bhatnagar_Madani_2025,
167
- title={E1: Retrieval-Augmented Protein Encoder Models},
168
- url={https://www.biorxiv.org/content/early/2025/11/13/2025.11.12.688125},
169
- DOI={10.1101/2025.11.12.688125},
170
- journal={bioRxiv},
171
- publisher={Cold Spring Harbor Laboratory},
172
- author={Jain, Sarthak and Beazer, Joel and Ruffolo, Jeffrey A and Bhatnagar, Aadyot and Madani, Ali},
173
- year={2025}
174
- }
175
  ```
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # NOTE
7
+ The GitHub with the implementation and requirements.txt can be found [here](https://github.com/Synthyra/FastPLMs.git)
8
+
9
+ # Profluent-E1
10
+ [Synthyra's version of Profluent-E1](https://github.com/Synthyra/Profluent-E1-300M) is a faithful implementation of Profluent's [E1](https://www.profluent.bio/showcase/e1) models ([license](https://github.com/Profluent-AI/E1/tree/main?tab=License-1-ov-file)) that integrates Huggingface AutoModel compatability and nice embedding functionality.
11
+
12
+ ## Attention backends
13
+
14
+ `sdpa` (PyTorch Scaled Dot Product Attention) is the default. The backend is set via `config.attn_backend` before loading.
15
+
16
+ | Backend | Key | Notes |
17
+ | :--- | :--- | :--- |
18
+ | PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
19
+ | Flash Attention | `"kernels_flash"` | Fastest on Ampere/Hopper GPUs. Requires `pip install kernels` (pre-built β€” no hours-long compilation). Outputs are not bitwise identical to SDPA due to online softmax reordering; differences are often small but not guaranteed to be inconsequential β€” use `"sdpa"` if exact numerics matter. |
20
+ | Flex Attention | `"flex"` | Uses a block-causal mask that skips padding tokens. Near-exact numerics. First use compiles a Triton kernel (30–120 s). Best combined with `torch.compile`. |
21
+ | Auto | `"auto"` | Picks the best available: `kernels_flash` β†’ `flex` β†’ `sdpa`. |
22
+
23
+ ```python
24
+ from transformers import AutoConfig, AutoModelForMaskedLM
25
+
26
+ config = AutoConfig.from_pretrained("Synthyra/Profluent-E1-150M", trust_remote_code=True)
27
+ config.attn_backend = "flex" # or "kernels_flash", "sdpa", "auto"
28
+ model = AutoModelForMaskedLM.from_pretrained("Synthyra/Profluent-E1-150M", config=config, trust_remote_code=True)
29
+ ```
30
+
31
+ `torch.compile(model)` is heavily recommended for sustained throughput, especially with Flex Attention.
32
+
33
+
34
+ ## Use with πŸ€— transformers
35
+ ### Supported models
36
+ ```python
37
+ model_dict = {
38
+ # Synthyra/Profluent-E1-150M
39
+ 'Profluent-E1-150M': 'Profluent-Bio/E1-150m',
40
+ # Synthyra/Profluent-E1-150M
41
+ 'Profluent-E1-300M': 'Profluent-Bio/E1-300m',
42
+ # Synthyra/Profluent-E1-150M
43
+ 'Profluent-E1-600M': 'Profluent-Bio/E1-600m',
44
+ }
45
+ ```
46
+
47
+ ```python
48
+ import torch
49
+ from transformers import AutoModelForMaskedLM
50
+
51
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
+ model = AutoModelForMaskedLM.from_pretrained('Synthyra/Profluent-E1-150M', trust_remote_code=True, dtype=torch.bfloat16).eval().to(device)
53
+
54
+ sequences = ['MPRTEIN', 'MSEQWENCE']
55
+ batch = model.prep_tokens.get_batch_kwargs(sequences, device=device)
56
+
57
+ output = model(**batch) # get all hidden states with output_hidden_states=True
58
+ print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 34)
59
+ print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 768)
60
+ print(output.loss) # language modeling loss if you passed labels
61
+ #print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
62
+ #print(outout.attentions) # all attention matrices if you passed output_attentions=True (in tuple)
63
+ ```
64
+
65
+ Our E1 implementation also supports sequence and token level classification tasks like ESM2. Simply pass the number of labels during initialization.
66
+
67
+ ```python
68
+ from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification
69
+
70
+ model = AutoModelForSequenceClassification.from_pretrained('Synthyra/Profluent-E1-150M', num_labels=2, trust_remote_code=True)
71
+ logits = model(**batch, labels=labels).logits
72
+ print(logits.shape) # (batch_size, num_labels), (2, 2)
73
+ ```
74
+
75
+ E1 weights were trained in bf16 and are in bf16 by default. You can load them in the precision of your choosing by leveraging the dtype parameter:
76
+ ```python
77
+ import torch
78
+ model = AutoModelForMaskedLM.from_pretrained('Synthyra/Profluent-E1-150M', trust_remote_code=True, dtype=torch.float) # fp32
79
+ ```
80
+
81
+ ## Embed entire datasets with no new code
82
+ To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
83
+
84
+ Example:
85
+ ```python
86
+ embedding_dict = model.embed_dataset(
87
+ sequences=[
88
+ 'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
89
+ ],
90
+ batch_size=2, # adjust for your GPU memory
91
+ max_len=512, # adjust for your needs
92
+ full_embeddings=False, # if True, no pooling is performed
93
+ embed_dtype=torch.float32, # cast to what dtype you want
94
+ pooling_types=['mean', 'cls'], # more than one pooling type will be concatenated together
95
+ sql=False, # if True, embeddings will be stored in SQLite database
96
+ sql_db_path='embeddings.db',
97
+ save=True, # if True, embeddings will be saved as a .pth file
98
+ save_path='embeddings.pth',
99
+ )
100
+ # embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
101
+ ```
102
+
103
+ ```
104
+ model.embed_dataset()
105
+ Args:
106
+ sequences: List of protein sequences
107
+ batch_size: Batch size for processing
108
+ max_len: Maximum sequence length
109
+ full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
110
+ pooling_type: Type of pooling ('mean' or 'cls')
111
+ sql: Whether to store embeddings in SQLite database - will be stored in float32
112
+ sql_db_path: Path to SQLite database
113
+
114
+ Returns:
115
+ Dictionary mapping sequences to embeddings, or None if sql=True
116
+
117
+ Note:
118
+ - If sql=True, embeddings can only be stored in float32
119
+ - sql is ideal if you need to stream a very large dataset for training in real-time
120
+ - save=True is ideal if you can store the entire embedding dictionary in RAM
121
+ - sql will be used if it is True and save is True or False
122
+ - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
123
+ - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
124
+ ```
125
+
126
+ ## Fine-tuning with πŸ€— peft
127
+ ```python
128
+ model = AutoModelForSequenceClassification.from_pretrained('Synthyra/Profluent-E1-150M', num_labels=2, trust_remote_code=True)
129
+ # these modules handle E1 attention layers
130
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
131
+
132
+ lora_config = LoraConfig(
133
+ r=8, # choose lora parameters to your liking
134
+ lora_alpha=16,
135
+ lora_dropout=0.01,
136
+ bias="none",
137
+ target_modules=target_modules,
138
+ )
139
+
140
+ # Apply LoRA to the model
141
+ model = get_peft_model(model, lora_config)
142
+
143
+ # Unfreeze the classifier head
144
+ for param in model.classifier.parameters():
145
+ param.requires_grad = True
146
+ ```
147
+
148
+ For a more thourough example of fine-tuning, check out our example script [here](https://github.com/Synthyra/FastPLMs/blob/main/fine_tuning_example.py).
149
+
150
+
151
+ ### Citation
152
+ If you use any of this implementation or work please cite the following DOI and Profluent's paper.
153
+
154
+ ```
155
+ @misc {FastPLMs,
156
+ author = { Hallee, Logan and Bichara, David and Gleghorn, Jason P.},
157
+ title = { FastPLMs: Fast, efficient, protien language model inference from Huggingface AutoModel.},
158
+ year = {2024},
159
+ url = { https://huggingface.co/Synthyra/ESMplusplus_small },
160
+ DOI = { 10.57967/hf/3726 },
161
+ publisher = { Hugging Face }
162
+ }
163
+ ```
164
+
165
+ ```
166
+ @article{Jain_Beazer_Ruffolo_Bhatnagar_Madani_2025,
167
+ title={E1: Retrieval-Augmented Protein Encoder Models},
168
+ url={https://www.biorxiv.org/content/early/2025/11/13/2025.11.12.688125},
169
+ DOI={10.1101/2025.11.12.688125},
170
+ journal={bioRxiv},
171
+ publisher={Cold Spring Harbor Laboratory},
172
+ author={Jain, Sarthak and Beazer, Joel and Ruffolo, Jeffrey A and Bhatnagar, Aadyot and Madani, Ali},
173
+ year={2025}
174
+ }
175
  ```