ebetica commited on
Commit
fabcd3a
·
verified ·
1 Parent(s): c899ee7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +117 -108
README.md CHANGED
@@ -1,9 +1,9 @@
1
  ---
2
- license:
3
- - mit
4
  - other
5
  license_link: https://github.com/Biohub/esm/blob/main/THIRD_PARTY_NOTICE.md
6
- language: en
7
  tags:
8
  - biology
9
  - esm
@@ -16,162 +16,169 @@ tags:
16
  - molecular-dynamics
17
  - transformers
18
  ---
 
19
  # Model Card for ESMFold2
20
 
21
  ## Model Details
22
 
23
- ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model achieves unprecedented success rates for protein binder and antibody generation, validated through biophysical and functional characterization.
24
 
25
- The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for enhanced accuracy on challenging targets. The model outputs comprehensive structural information including all-atom coordinates (backbone and side chains), confidence metrics (pLDDT, pAE, pTM, iPTM), and optional distogram predictions for detailed analysis of predicted structures. Unlike ESMFold, ESMCFold is able to predict structures for all biomolecules, including small molecules, DNA, RNA, and modified amino acids.
26
 
27
- ESMFold2 is capable of either single-sequence or MSA conditioned structure prediction for improved accuracy on difficult targets.
28
 
29
- To run this model with the Biohub Platform API, visit the [Biohub Platform](https://biohub.ai/).
30
 
31
- Read more about ESMFold2 in our paper [here](https://biohub.ai/papers/esmc.pdf).
32
 
33
- ### System Requirements
 
 
 
34
 
35
- * Compute Requirements: GPU
36
- * PyTorch environment with GPU support recommended.
37
 
38
- ### Usage
39
 
40
- Coming soon
41
 
42
- ### Citation
43
 
44
- Coming soon.
45
 
46
- ### Model Architecture
47
 
48
- The architecture consists of an input embedder, which includes the ESMC language model embeddings, a pairwise folding block trunk, an atom diffusion module, an optional MSA encoder, and a confidence head.
 
 
49
 
50
- This hybrid architecture combines the representational power of large-scale protein language models with and efficient structure prediction modules listed in detail below:
51
 
52
- **1\. Language Model Inputs Embedder:**
 
 
 
 
 
 
53
 
54
- - Uses a frozen ESMC 6B parameter language model to generate rich single-sequence embeddings.
55
-
56
- - Provides evolutionary and functional context encoded in high-dimensional representations.
57
-
58
- - Embeddings capture long-range dependencies and sequence patterns learned from billions of protein sequences.
59
 
60
- **2\. Pairformer Trunk:**
 
 
61
 
62
- - Processes language model embeddings through a pairformer architecture that refines pairwise and single representations
63
-
64
- - Architectural improvements: Triangle attention was removed as it did not provide significant benefits, and the single state was removed to improve generalization
65
-
66
- - Evolves pairwise representations through iterative updates, enabling the model to capture inter-residue relationships and spatial constraints
67
 
68
- **3\. Diffusion-Based Structure Module:**
 
 
69
 
70
- - Employs a diffusion process to generate all-atom 3D coordinates (backbone and sidechains)
71
-
72
- - Uses an ODE (Ordinary Differential Equation) solver for efficient sampling
73
-
74
- - Optimized sampling schedule: 32 steps for sidechain prediction and 8 steps for backbone prediction provide optimal performance without additional benefit from more steps
75
-
76
- - This efficient sampling strategy significantly reduces computational cost while maintaining accuracy
77
 
78
- **4\. Optional MSA Encoder (ESMFold2 MSA variant):**
 
79
 
80
- - Processes multiple sequence alignments to extract evolutionary information
81
-
82
- - Distills MSA information into pairwise representations that enhance structure prediction
83
-
84
- - Particularly beneficial for difficult targets with limited sequence information
85
 
86
- **5\. Confidence Head:**
 
87
 
88
- - Additional 4-layer pairformer module that estimates prediction confidence
89
-
90
- - Outputs multiple confidence metrics: pLDDT (per-residue), pAE (pairwise aligned error), pTM (template modeling score), and iPTM (interface template modeling score for complexes)
91
 
92
- **6\. Auxiliary Losses:**
93
 
94
- - A trunk distogram head, a smooth-LDDT loss on the denoised sample, and a polymer-ligand bond loss supplement the diffusion MSE and confidence losses.
 
 
95
 
96
- **Model Variants:**
97
 
98
- | Model | MSA Conditioning | Description | Data Cutoff |
99
- | :---- | :---- | :---- | :---- |
100
- | [esmfold2-fast](https://huggingface.co/biohub/esmfold2-fast) | No | Inference optimized single-sequence structure prediction model | June 2025, older cutoff is Sept 2021 |
101
- | [esmfold2](https://huggingface.co/biohub/esmfold2) | Yes | Large model, capable of either single-sequence or MSA conditioned structure prediction for improved accuracy on difficult targets | June 2025, older cutoff is Sept 2021 |
102
 
103
- **Confidence Head (optional, for inference):**
104
 
105
- - 4 pairformer layers
106
- - Outputs: pLDDT, pAE, pTM, ipTM
 
 
 
107
 
108
- ## Intended Use
109
 
110
- ### Primary Use Cases
 
 
111
 
112
- ESMFold2 is designed to handle a wide range of structural prediction tasks, including:
113
 
114
- #### **Complex biomolecular interactions**
115
 
116
- - **Complex structure folding**: Predict structures of DNA, RNA, and small molecule ligands, including protein–protein interactions and protein-nucleic acid complexes.
 
117
 
118
- - **Single-sequence folding**: Predict 3D structures from amino acid sequences without requiring multiple sequence alignments
 
 
119
 
 
 
 
 
 
 
 
 
120
 
121
- #### **Confidence Estimation and Quality Assessment**
 
 
122
 
123
- - **Per-residue confidence (pLDDT)**: Local confidence scores (0-100) indicating prediction reliability at each residue position
124
-
125
- - **Predicted aligned error (pAE)**: Pairwise error estimates for assessing inter-residue distance accuracy
126
-
127
- - **Template modeling score (pTM)**: Global confidence metric (0-1) for overall structure topology and domain packing
128
-
129
- - **Interface confidence (iPTM)**: Specialized confidence scores for multimeric complex interfaces
130
 
131
- #### **Computational Biology Research**
 
 
132
 
133
- - **Evolutionary analysis**: Study structural conservation and divergence across protein families
134
-
135
- - **Disease variant analysis**: Predict structural impacts of mutations and genetic variants
136
-
137
- - **Drug discovery**: Support structure-based drug design through accurate binding site prediction
138
 
139
- ## Training Details
140
 
141
- The model was trained on sequences from [PDB](https://www.rcsb.org/) and [AlphaFold DB](https://alphafold.ebi.ac.uk/). Refer to the supplement in the [paper](https://biohub.ai/esmc) for details on how data is processed and sampled.
142
 
143
- ## Performance Metrics
144
 
145
- ESMfold2 was evaluated against state-of-the-art single-sequence and MSA-based structure prediction models on the FoldBench benchmark. ESMFold2 meets or exceeds performance by AlphaFold3 on antibody-antigen complex prediction, protein-protein complex prediction and [Runs N’ Poses](https://www.biorxiv.org/content/10.1101/2025.02.03.636309v1) benchmarks. For inference time-scaling on FoldBench, ESMCFold scales gracefully with sample count, comparable to or exceeding AlphaFold3’s scaling behavior on the same targets.
146
 
147
- Refer to the [paper](https://biohub.ai/papers/esmc.pdf) for details on additional performance metrics.
148
 
149
- ## Biases, Risks, and Limitations
150
 
151
  ### Potential Biases
152
 
153
  - The model may reflect biases present in the training data.
154
-
155
- ### Risks
156
-
157
- - Potential misuse: Protein structure predictions may be used for malicious purposes such as designing pathogens or toxins.
158
 
159
  ### Limitations
160
 
161
- - **Static structure prediction**: The model predicts single static conformations and is not designed for modeling protein dynamics, conformational flexibility, or multiple conformations of the same protein. It does not capture functional motions, allosteric transitions, or time-dependent structural changes.
162
-
163
- - **Sequence length constraints**: Maximum sequence length is approximately 1600 residues per chain, with total complex size limited by available GPU memory. Very large complexes may require specialized hardware or memory optimization strategies.
164
-
165
- - **Training data biases**: The model may reflect biases present in the training data (PDB, AFDB, SAbDab), including over-representation of certain protein families, experimental conditions, or structural classes. Performance may vary for underrepresented protein types.
166
-
167
- - **Novel folds and rare structures**: While the model generalizes well, extremely novel folds or rare structural motifs not well-represented in training data may have reduced accuracy.
168
-
169
- - **Post-translational modifications**: The model provides limited support for non-standard amino acids and post-translational modifications. Performance may be reduced for heavily modified proteins.
170
-
171
- - **Membrane proteins**: While the model can predict membrane protein structures, specialized membrane protein prediction tools may provide better results for transmembrane domains and membrane-embedded regions.
172
-
173
- - **Disordered regions**: The model may struggle with intrinsically disordered regions, which lack well-defined structure. Low pLDDT scores (\<50) often indicate disordered or flexible regions.
174
-
175
  - **Experimental validation required**: All predictions should be considered hypotheses requiring experimental validation. The model cannot replace experimental structure determination methods (X-ray crystallography, cryo-EM, NMR) for definitive structural characterization.
176
 
177
  ### Out-of-Scope or Unauthorized Use Cases
@@ -182,16 +189,18 @@ Do not use the model for the following purposes:
182
 
183
  ### Caveats and Recommendations
184
 
185
- - Always review and validate outputs generated by the model.
186
-
187
- - Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
188
-
189
- - We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](https://biohub.org/acceptable-use-policy/) when using the model.
190
 
191
  Should you have any security or privacy issues or questions related to this model, please reach out to our team at [support@biohub.org](mailto:support@biohub.org).
192
 
 
 
 
 
193
  ## Acknowledgements
194
 
195
  Many people on the Biohub AI Research team and prior EvolutionaryScale team contributed to the development of this model. It would not have been possible without them.
196
 
197
-
 
1
  ---
2
+ license:
3
+ - mit
4
  - other
5
  license_link: https://github.com/Biohub/esm/blob/main/THIRD_PARTY_NOTICE.md
6
+ language: en
7
  tags:
8
  - biology
9
  - esm
 
16
  - molecular-dynamics
17
  - transformers
18
  ---
19
+
20
  # Model Card for ESMFold2
21
 
22
  ## Model Details
23
 
24
+ ESMFold2 is a state-of-the-art model for protein structure prediction and design that defines a new frontier for speed and accuracy. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for enhanced accuracy on challenging targets. The model outputs comprehensive structural information including all-atom coordinates (backbone and side chains), confidence metrics (pLDDT, pAE, pTM, iPTM), and optional distogram predictions for detailed analysis of predicted structures. Unlike ESMFold, ESMFold2 is able to predict structures for all biomolecules, including small molecules, DNA, RNA, and modified amino acids.
25
 
26
+ ESMFold2 is capable of either single-sequence or MSA conditioned structure prediction for improved accuracy on difficult targets. The ESMFold2-Fast variant is an inference optimized single-sequence structure prediction model and is not MSA conditioned.
27
 
28
+ To run this model with the Biohub Platform API, visit the [Biohub Platform](https://biohub.ai/).
29
 
30
+ Read more about ESMFold2 in our paper [here](https://biohub.ai/papers/esm_protein.pdf).
31
 
32
+ ## Model Variants
33
 
34
+ | Model | MSA Conditioning | Description | Data Cutoff |
35
+ | :---- | :---- | :---- | :---- |
36
+ | [esmfold2](https://huggingface.co/biohub/esmfold2) | Yes | Large model, capable of either single-sequence or MSA conditioned structure prediction for improved accuracy on difficult targets | Sept 2021 |
37
+ | [esmfold2-fast](https://huggingface.co/biohub/esmfold2-fast) | No | Inference optimized single-sequence structure prediction model | Sept 2021 |
38
 
39
+ ## Performance Metrics
 
40
 
41
+ ESMfold2 was evaluated against state-of-the-art single-sequence and MSA-based structure prediction models on the FoldBench benchmark. ESMFold2 meets or exceeds performance by AlphaFold3 on antibody-antigen complex prediction, protein-protein complex prediction and [Runs N' Poses](https://www.biorxiv.org/content/10.1101/2025.02.03.636309v1) benchmarks. Inference-time compute can dramatically improve performance of ESMFold2, especially across antibody-antigen complexes.
42
 
43
+ ![][image1]
44
 
45
+ Refer to the [paper](https://biohub.ai/papers/esmc.pdf) for details on additional performance metrics.
46
 
47
+ ### Usage
48
 
49
+ Please install `esm` via PyPi:
50
 
51
+ ```
52
+ pip install esm
53
+ ```
54
 
55
+ You can fold your first protein with
56
 
57
+ ```py
58
+ from esm.models.esmfold2 import (
59
+ ESMFold2InputBuilder,
60
+ ProteinInput,
61
+ StructurePredictionInput,
62
+ )
63
+ from transformers.models.esmfold2.modeling_esmfold2 import ESMFold2Model
64
 
65
+ # Ubiquitin (PDB 1UBQ)
66
+ sequence = (
67
+ "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
68
+ )
 
69
 
70
+ model = ESMFold2Model.from_pretrained("biohub/ESMFold2").cuda().eval()
71
+ processor = ESMFold2InputBuilder()
72
+ spi = StructurePredictionInput(sequences=[ProteinInput(id="A", sequence=sequence)])
73
 
74
+ result = processor.fold(
75
+ model, spi, num_loops=3, num_sampling_steps=50, num_diffusion_samples=1, seed=0
76
+ )
 
 
77
 
78
+ print(f"pLDDT mean: {float(result.plddt.mean()):.3f}")
79
+ print(f"pTM: {float(result.ptm):.3f}")
80
+ ```
81
 
82
+ You may directly use the model through huggingface/transformers
 
 
 
 
 
 
83
 
84
+ ```py
85
+ from transformers.models.esmfold2.modeling_esmfold2 import ESMFold2Model
86
 
87
+ # Ubiquitin (PDB 1UBQ)
88
+ sequence = (
89
+ "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
90
+ )
 
91
 
92
+ model = ESMFold2Model.from_pretrained("biohub/ESMFold2").cuda().eval()
93
+ output = model.infer_protein(sequence, num_loops=3, num_sampling_steps=50)
94
 
95
+ print(f"pLDDT mean: {float(output['plddt'].mean()):.3f}")
96
+ print(f"pTM: {float(output['ptm'].mean()):.3f}")
97
+ ```
98
 
99
+ And the Biohub API:
100
 
101
+ ```py
102
+ TODO minimal biohub API code snippet
103
+ ```
104
 
105
+ First install the `esm` Python package.
106
 
107
+ ```
108
+ pip install esm
109
+ ```
 
110
 
111
+ Import the necessary libraries.
112
 
113
+ ```py
114
+ from esm.sdk.forge import SequenceStructureForgeInferenceClient
115
+ from esm.sdk import client
116
+ from esm.sdk.api import ESMProtein, ESMProteinError, LogitsConfig, LogitsOutput
117
+ ```
118
 
119
+ Generate an [API key](https://biohub.ai/developer-console/api-keys) and add it to your Biohub account. This API key manages your access to credits and tokens, and the term API key/token is often used interchangeably within documentation. Call the inference client with the selected model of choice and replace with your token name.
120
 
121
+ ```py
122
+ client = SequenceStructureForgeInferenceClient(model="esmfold2-fast-2026-05", url="https://biohub.ai", token="<your API token>")
123
+ ```
124
 
125
+ ####
126
 
127
+ The Hugging Face implementation directly supports proteins only. For complex biomolecules, we recommend using the internal API. Here's an example of folding a Ubiquitin with ESMFold2:
128
 
129
+ ```py
130
+ import os
131
 
132
+ from esm.models.esmfold2 import LigandInput, ProteinInput, StructurePredictionInput
133
+ from esm.sdk import esmfold2_client
134
+ from esm.sdk.api import FoldingConfig
135
 
136
+ # Ubiquitin (PDB 1UBQ) + ATP cofactor (illustrative pairing).
137
+ protein = ProteinInput(
138
+ id="A",
139
+ sequence=(
140
+ "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
141
+ ),
142
+ )
143
+ ligand = LigandInput(id="L", ccd=["ATP"])
144
 
145
+ # TODO: replace ``esmc-fold-flash-2604`` with the public ESMFold2 model name
146
+ # once the Biohub Platform inference server announces it.
147
+ client = esmfold2_client(model="esmc-fold-flash-2604", token=os.environ["ESM_API_KEY"])
148
 
149
+ spi = StructurePredictionInput(sequences=[protein, ligand])
150
+ result = client.fold_all_atom(
151
+ spi, config=FoldingConfig(num_loops=3, num_sampling_steps=50)
152
+ )
 
 
 
153
 
154
+ print(f"pLDDT mean: {float(result.plddt.mean()):.3f}")
155
+ print(f"pTM: {float(result.ptm):.3f}")
156
+ ```
157
 
158
+ ## Training Data
 
 
 
 
159
 
160
+ ESMfold2 was trained on sequences from the Protein Data Bank (PDB) and the AlphaFold DB (AFDB).
161
 
162
+ ## Frontier Safety
163
 
164
+ Biohub has established a safety team to assess the benefits and potential risks of our models and tools prior to release, and develop mitigations where necessary. Risk assessment was conducted for ESMFold2 prior to release. Further details are available in our corresponding paper appendix.
165
 
166
+ Informed by our risk assessments, we are releasing the source code and model weights for ESMFold2.
167
 
168
+ [Biohub.ai](http://Biohub.ai) Platform: We implement guardrails that detect and restrict the use of keywords and sequences corresponding to controlled pathogens and toxins on our freely accessible platform. For further details regarding these guardrails, please refer to our Biohub platform Resources page.
169
 
170
+ ## Biases and Limitations
171
 
172
  ### Potential Biases
173
 
174
  - The model may reflect biases present in the training data.
175
+ -
 
 
 
176
 
177
  ### Limitations
178
 
179
+ - **Dataset biases**: The model may reflect biases present in the training data (PDB, AFDB), including over-representation of certain protein families, experimental conditions, or structural classes. Performance may vary for underrepresented protein types.
180
+ - **Dataset limitations:** PDB historically lacks comprehensive data on protein conformations, post-translational modifications, disordered regions, etc. Like all other structure prediction models trained on the PDB, performance may degrade on other biomolecules.
181
+ - **Computational demand:** Highest accuracy structure predictions require scaling inference time compute. Predictions made with reduced inference parameters may lead to suboptimal performance.
 
 
 
 
 
 
 
 
 
 
 
182
  - **Experimental validation required**: All predictions should be considered hypotheses requiring experimental validation. The model cannot replace experimental structure determination methods (X-ray crystallography, cryo-EM, NMR) for definitive structural characterization.
183
 
184
  ### Out-of-Scope or Unauthorized Use Cases
 
189
 
190
  ### Caveats and Recommendations
191
 
192
+ - Always review and validate outputs generated by the model.
193
+ - Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
194
+ - We are committed to advancing the responsible development and use of artificial intelligence.
 
 
195
 
196
  Should you have any security or privacy issues or questions related to this model, please reach out to our team at [support@biohub.org](mailto:support@biohub.org).
197
 
198
+ ### Citation
199
+
200
+ Coming soon.
201
+
202
  ## Acknowledgements
203
 
204
  Many people on the Biohub AI Research team and prior EvolutionaryScale team contributed to the development of this model. It would not have been possible without them.
205
 
206
+ [image1]: folding_evals.png