dmccarthy commited on
Commit
c147fc8
·
verified ·
1 Parent(s): 370aab2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md CHANGED
@@ -1,3 +1,176 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ # Model Card for ESMCFold-Fast
5
+
6
+ ## Model Details
7
+
8
+ ESMCFold is a state-of-the-art protein structure prediction model that combines ESMC (6B parameter) language model representations with a diffusion-based structure prediction architecture inspired by AlphaFold3. The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for enhanced accuracy on challenging targets. The model outputs comprehensive structural information including all-atom coordinates (backbone and side chains), confidence metrics (pLDDT, pAE, pTM, iPTM), and optional distogram predictions for detailed analysis of predicted structures. Unlike ESMFold, ESMCFold is able to predict structures for all biomolecules, including small molecules, DNA, RNA, and modified amino acids.
9
+
10
+ ESMCFold-Fast is an Inference optimized single-sequence structure prediction model and is not MSA conditioned.
11
+
12
+ For additional information, visit the [Biohub Platform](https://biohub.ai) for no-code tools, step-by-step tutorial notebooks, and detailed information on the models.
13
+
14
+ ## System Requirements
15
+
16
+ * Compute Requirements: GPU
17
+ * PyTorch environment with GPU support recommended.
18
+
19
+ ### Model Architecture
20
+
21
+ The architecture consists of an input embedder, which includes the ESMC language model embeddings, a pairwise folding block trunk, an atom diffusion module, an optional MSA encoder, and a confidence head.
22
+
23
+ This hybrid architecture combines the representational power of large-scale protein language models with and efficient structure prediction modules listed in detail below:
24
+
25
+ **1\. Language Model Inputs Embedder:**
26
+
27
+ - Uses a frozen ESMC 6B parameter language model to generate rich single-sequence embeddings.
28
+
29
+ - Provides evolutionary and functional context encoded in high-dimensional representations.
30
+
31
+ - Embeddings capture long-range dependencies and sequence patterns learned from billions of protein sequences.
32
+
33
+ **2\. Pairformer Trunk:**
34
+
35
+ - Processes language model embeddings through a pairformer architecture that refines pairwise and single representations
36
+
37
+ - Architectural improvements: Triangle attention was removed as it did not provide significant benefits, and the single state was removed to improve generalization
38
+
39
+ - Evolves pairwise representations through iterative updates, enabling the model to capture inter-residue relationships and spatial constraints
40
+
41
+ **3\. Diffusion-Based Structure Module:**
42
+
43
+ - Employs a diffusion process to generate all-atom 3D coordinates (backbone and sidechains)
44
+
45
+ - Uses an ODE (Ordinary Differential Equation) solver for efficient sampling
46
+
47
+ - Optimized sampling schedule: 32 steps for sidechain prediction and 8 steps for backbone prediction provide optimal performance without additional benefit from more steps
48
+
49
+ - This efficient sampling strategy significantly reduces computational cost while maintaining accuracy
50
+
51
+ **4\. Optional MSA Encoder (ESMCFold MSA variant):**
52
+
53
+ - Processes multiple sequence alignments to extract evolutionary information
54
+
55
+ - Distills MSA information into pairwise representations that enhance structure prediction
56
+
57
+ - Particularly beneficial for difficult targets with limited sequence information
58
+
59
+ **5\. Confidence Head:**
60
+
61
+ - Additional 4-layer pairformer module that estimates prediction confidence
62
+
63
+ - Outputs multiple confidence metrics: pLDDT (per-residue), pAE (pairwise aligned error), pTM (template modeling score), and iPTM (interface template modeling score for complexes)
64
+
65
+ **6\. Auxiliary Losses:**
66
+
67
+ - A trunk distogram head, a smooth-LDDT loss on the denoised sample, and a polymer-ligand bond loss supplement the diffusion MSE and confidence losses.
68
+
69
+ **Model Variants:**
70
+
71
+ | Model | MSA Conditioning | Description | Data Cutoff |
72
+ | :---- | :---- | :---- | :---- |
73
+ | [esmcfold-fast-2605](https://huggingface.co/biohub/esmcfold-fast) | No | Inference optimized single-sequence structure prediction model | June 2025, older cutoff is Sept 2021 |
74
+ | [esmcfold-2605](https://huggingface.co/biohub/esmcfold) | Yes | Large model, capable of either single-sequence or MSA conditioned structure prediction for improved accuracy on difficult targets | June 2025, older cutoff is Sept 2021 |
75
+
76
+ **Confidence Head (optional, for inference):**
77
+
78
+ - 4 pairformer layers
79
+ - Outputs: pLDDT, pAE, pTM, ipTM
80
+
81
+ ## Intended Use
82
+
83
+ ### Primary Use Cases
84
+
85
+ ESMCFold-Fast is designed to handle a wide range of structural prediction tasks, including:
86
+
87
+ #### **Complex biomolecular interactions**
88
+
89
+ - **Complex structure folding**: Predict structures of DNA, RNA, and small molecule ligands, including protein–protein interactions and protein-nucleic acid complexes.
90
+
91
+ - **Single-sequence folding**: Predict 3D structures from amino acid sequences without requiring multiple sequence alignments
92
+
93
+
94
+ #### **Confidence Estimation and Quality Assessment**
95
+
96
+ - **Per-residue confidence (pLDDT)**: Local confidence scores (0-100) indicating prediction reliability at each residue position
97
+
98
+ - **Predicted aligned error (pAE)**: Pairwise error estimates for assessing inter-residue distance accuracy
99
+
100
+ - **Template modeling score (pTM)**: Global confidence metric (0-1) for overall structure topology and domain packing
101
+
102
+ - **Interface confidence (iPTM)**: Specialized confidence scores for multimeric complex interfaces
103
+
104
+ #### **Computational Biology Research**
105
+
106
+ - **Evolutionary analysis**: Study structural conservation and divergence across protein families
107
+
108
+ - **Disease variant analysis**: Predict structural impacts of mutations and genetic variants
109
+
110
+ - **Drug discovery**: Support structure-based drug design through accurate binding site prediction
111
+
112
+ ## Training Details
113
+
114
+ The model was trained on sequences from [PDB](https://www.rcsb.org/) and [AlphaFold DB](https://alphafold.ebi.ac.uk/). Refer to the supplement in the [paper](https://biohub.ai/esmc) for details on how data is processed and sampled.
115
+
116
+ ## Performance Metrics
117
+
118
+ ESMCfold was evaluated against state-of-the-art single-sequence and MSA-based structure prediction models on the FoldBench benchmark. ESMCFold meets or exceeds performance by AlphaFold3 on antibody-antigen complex prediction, protein-protein complex prediction and [Runs N’ Poses](https://www.biorxiv.org/content/10.1101/2025.02.03.636309v1) benchmarks. For inference time-scaling on FoldBench, ESMCFold scales gracefully with sample count, comparable to or exceeding AlphaFold3’s scaling behavior on the same targets.
119
+
120
+ Refer to the [paper](https://biohub.ai/papers/esmc.pdf) for details on additional performance metrics.
121
+
122
+ ## Usage
123
+
124
+ Coming soon
125
+
126
+ ## Biases, Risks, and Limitations
127
+
128
+ ### Potential Biases
129
+
130
+ - The model may reflect biases present in the training data.
131
+
132
+ ### Risks
133
+
134
+ - **Potential misuse**: Protein structure predictions may be used for malicious purposes such as designing pathogens or toxins.
135
+
136
+ ### Limitations
137
+
138
+ - **Static structure prediction**: The model predicts single static conformations and is not designed for modeling protein dynamics, conformational flexibility, or multiple conformations of the same protein. It does not capture functional motions, allosteric transitions, or time-dependent structural changes.
139
+
140
+ - **Sequence length constraints**: Maximum sequence length is approximately 1600 residues per chain, with total complex size limited by available GPU memory. Very large complexes may require specialized hardware or memory optimization strategies.
141
+
142
+ - **Training data biases**: The model may reflect biases present in the training data (PDB, AFDB, SAbDab), including over-representation of certain protein families, experimental conditions, or structural classes. Performance may vary for underrepresented protein types.
143
+
144
+ - **Novel folds and rare structures**: While the model generalizes well, extremely novel folds or rare structural motifs not well-represented in training data may have reduced accuracy.
145
+
146
+ - **Post-translational modifications**: The model provides limited support for non-standard amino acids and post-translational modifications. Performance may be reduced for heavily modified proteins.
147
+
148
+ - **Membrane proteins**: While the model can predict membrane protein structures, specialized membrane protein prediction tools may provide better results for transmembrane domains and membrane-embedded regions.
149
+
150
+ - **Disordered regions**: The model may struggle with intrinsically disordered regions, which lack well-defined structure. Low pLDDT scores (\<50) often indicate disordered or flexible regions.
151
+
152
+ - **Experimental validation required**: All predictions should be considered hypotheses requiring experimental validation. The model cannot replace experimental structure determination methods (X-ray crystallography, cryo-EM, NMR) for definitive structural characterization.
153
+
154
+ ### Out-of-Scope or Unauthorized Use Cases
155
+
156
+ Do not use the model for the following purposes:
157
+
158
+ - Clinical diagnosis or treatment recommendations.
159
+ - Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
160
+ - Any use that is prohibited by the [model license](https://github.com/Biohub/esm/blob/main/LICENSE.md).
161
+ - Any use that is prohibited by the [Acceptable Use Policy](https://biohub.org/acceptable-use-policy/).
162
+
163
+ ### Caveats and Recommendations
164
+
165
+ - Always review and validate outputs generated by the model.
166
+
167
+ - Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
168
+
169
+ - We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](https://biohub.org/acceptable-use-policy/) when using the model.
170
+
171
+ Should you have any security or privacy issues or questions related to this model, please reach out to our team at [support@biohub.org](mailto:support@biohub.org).
172
+
173
+ ## Acknowledgements
174
+
175
+ Many people on the Biohub AI Research team and prior EvolutionaryScale team contributed to the development of this model. It would not have been possible without them.
176
+