Update readme with bioemu 1.1 and 1.2 model information
#3
by
yuuuxie
- opened
README.md
CHANGED
|
@@ -1,8 +1,5 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
-
---
|
| 4 |
-
---
|
| 5 |
-
license: mit
|
| 6 |
license_link: https://opensource.org/license/mit
|
| 7 |
|
| 8 |
doi: https://doi.org/10.1101/2024.12.05.626885
|
|
@@ -23,9 +20,9 @@ The model is being released together with its companion BioEmu Benchmark (github
|
|
| 23 |
|
| 24 |
### Model Description
|
| 25 |
|
| 26 |
-
Biomolecular Emulator (BioEmu) is a deep learning model that
|
| 27 |
|
| 28 |
-
Please refer to the [
|
| 29 |
|
| 30 |
- **Developed by:** Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Cecilia Clementi, Frank Noé
|
| 31 |
- **Funded by:** Microsoft Research AI for Science
|
|
@@ -35,14 +32,14 @@ Please refer to the [BioEmu](https://www.biorxiv.org/content/10.1101/2024.12.05.
|
|
| 35 |
### Model Sources
|
| 36 |
|
| 37 |
- **Repository:** https://github.com/microsoft/bioemu
|
| 38 |
-
- **Paper:** https://www.
|
| 39 |
|
| 40 |
### Available Models
|
| 41 |
|
| 42 |
-
| | bioemu-v1.0 |
|
| 43 |
-
| ------------------ | --------------------- |
|
| 44 |
-
| Training Data Size | 161k structures (AFDB), 216 ms MD simulations, 19k dG measurements |
|
| 45 |
-
| Model Parameters |
|
| 46 |
|
| 47 |
|
| 48 |
## Uses
|
|
@@ -67,42 +64,44 @@ We evaluated model performance on the following tasks:
|
|
| 67 |
- emulation of molecular dynamics (MD) equilibrium distributions
|
| 68 |
- prediction of protein stabilities
|
| 69 |
|
| 70 |
-
For each task we developed a specific combination of testing data and metric, which will be described in the following. For additional details, please refer to the [
|
| 71 |
|
| 72 |
### Testing Data, Factors & Metrics
|
| 73 |
|
| 74 |
#### Testing Data
|
| 75 |
|
| 76 |
For testing **conformational changes**, sets of structures exhibiting different phenomena (local unfolding, domain motion and formation of cryptic pockets) were curated based on PDB reports and published literature. In addition,
|
| 77 |
-
regions affected by the changes were annotated manually ([section 4 of
|
| 78 |
-
To test **emulation of MD equilibrium distributions**, an in-house dataset of molecular dynamics simulation based on the [CATH classification](https://doi.org/10.1093/nar/gkaa1079) of proteins was generated ([SI 6]
|
| 79 |
-
**Protein stability predictions** were evaluated using a combination of published experimental folding free energies (https://www.nature.com/articles/s41586-023-06328-6) ([SI 5]
|
| 80 |
|
| 81 |
-
Details on how the different benchmark datasets were generated can be found in the [
|
| 82 |
|
| 83 |
#### Metrics
|
| 84 |
|
| 85 |
Each task was evaluated using specific metrics:
|
| 86 |
-
- Conformational change tasks were evaluated based on the coverage of reference states. A reference state was counted as covered if at least 0.1 percent of model samples were within a predefined threshold distance of the state, using an appropriate distance measure. The coverage was first averaged over all the reference states corresponding to each sequence, and then averaged over sequences. Coverages for local unfolding and crypic pockets were further classified into folded / unfolded and apo / holo state contributions ([SI 4.3]
|
| 87 |
-
- MD emulation performance was evaluated by computing time-lagged independent component analysis (TICA) projections of the generated MD data and identifying metastable states by hidden Markov model (HMM) analysis. Model samples were then projected into the same 2D space and assigned to metastable states based on the HMM. Finally, the mean absolute error between the free energies of these states was computed relative to the values obtained from the base MD simulations ([SI 6]
|
| 88 |
-
- Protein stability prediction was evaluated based on the mean absolute errors and correlation coefficients between experimentally measured folding free energies and model predictions ([SI 5.2]
|
| 89 |
|
| 90 |
-
In all cases, please refer to the [
|
| 91 |
|
| 92 |
### Results
|
| 93 |
|
| 94 |
-
For tasks investigating **conformational changes**, BioEmu model achieves overall coverages of
|
| 95 |
-
On the **emulation of MD equilibrium distributions**, BioEmu achieves a mean absolute error of 0.
|
| 96 |
Variants of BioEmu trained and tested on a dataset of fast folding proteins reported previously (https://doi.org/10.1126/science.1208351) achieved a mean absolute error of 0.74 kcal/mol.
|
| 97 |
-
In the **protein stability prediction** tasks, we obtain free energy mean absolute errors of 0.
|
|
|
|
| 98 |
|
| 99 |
-
All test datasets and code necessary to reproduce these results
|
| 100 |
|
| 101 |
## Technical Specifications
|
| 102 |
|
| 103 |
### Model Architecture and Objective
|
| 104 |
|
| 105 |
-
BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-024-00837-3) trained on a variety of datasets to sample systematically diverse structure ensembles. In the pretraining phase, we use denoising score matching to match the distribution of flexible protein structures curated from AFDB. In the fine-tuning phase, we use a combination of denoising score matching objective for molecular dynamics data and property prediction fine-tuning (PPFT) for matching the experimental folding free energies. For more details of PPFT, please see our [
|
|
|
|
| 106 |
|
| 107 |
#### Software
|
| 108 |
|
|
@@ -112,13 +111,14 @@ BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-
|
|
| 112 |
|
| 113 |
**BibTeX:**
|
| 114 |
```
|
| 115 |
-
@
|
| 116 |
-
title
|
| 117 |
-
author
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
| 122 |
}
|
| 123 |
```
|
| 124 |
|
|
@@ -126,7 +126,7 @@ BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-
|
|
| 126 |
|
| 127 |
We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected behavior in our technology, please contact us at:
|
| 128 |
- Frank Noe (franknoe@microsoft.com)
|
| 129 |
-
|
| 130 |
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
| 131 |
|
| 132 |
### Out-of-Scope Use
|
|
@@ -136,12 +136,14 @@ The model does not support generation of new protein sequences as it is designed
|
|
| 136 |
The model is intended for research and experimental purposes. Further testing/development are needed before considering its application in real-world scenarios.
|
| 137 |
|
| 138 |
## Bias, Risks, and Limitations
|
| 139 |
-
Our model has been trained on a large variety of structurally resolved proteins, so it inherits the biases of this data (see [
|
| 140 |
The current model has low prediction quality for protein-protein interactions, including multi-chain proteins, and does not feature explicit interactions with other chemical entities like small molecules.
|
| 141 |
Besides experimental data, the model is trained on synthetic data, which is predictions of AlphaFold2 and molecular dynamics simulations.
|
| 142 |
We expect that the approximations of these models are propagated to BioEmu.
|
| 143 |
|
| 144 |
|
| 145 |
### Recommendations
|
| 146 |
-
We recommend using this model only for the purposes specified here or described in the [
|
| 147 |
-
In particular, we
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
| 3 |
license_link: https://opensource.org/license/mit
|
| 4 |
|
| 5 |
doi: https://doi.org/10.1101/2024.12.05.626885
|
|
|
|
| 20 |
|
| 21 |
### Model Description
|
| 22 |
|
| 23 |
+
Biomolecular Emulator (BioEmu) is a deep learning model that emulates protein structural ensembles at a speed that is orders of magnitude faster than traditional molecular dynamics simulations. By leveraging novel training methods and vast data of protein structures, over 200 milliseconds of MD simulation, and experimental protein stabilities, BioEmu’s protein structural ensembles represent equilibrium in a range of challenging and practically relevant metrics. Qualitatively, BioEmu samples many functionally relevant conformational changes, ranging from formation of cryptic pockets, over unfolding of specific protein regions, to large-scale domain rearrangements. Quantitatively, BioEmu samples protein conformations with relative free energy errors around 1 kcal/mol, as validated against millisecond-timescale MD simulation and experimentally-measured protein stabilities. By simultaneously emulating structural ensembles and thermodynamic properties, BioEmu reveals mechanistic insights, such as the causes for fold destabilization of mutants, and can efficiently provide experimentally-testable hypotheses.
|
| 24 |
|
| 25 |
+
Please refer to the BioEmu [paper] for more details on the model.
|
| 26 |
|
| 27 |
- **Developed by:** Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Cecilia Clementi, Frank Noé
|
| 28 |
- **Funded by:** Microsoft Research AI for Science
|
|
|
|
| 32 |
### Model Sources
|
| 33 |
|
| 34 |
- **Repository:** https://github.com/microsoft/bioemu
|
| 35 |
+
- **Paper:** https://www.science.org/doi/10.1126/science.adv9817
|
| 36 |
|
| 37 |
### Available Models
|
| 38 |
|
| 39 |
+
| | bioemu-v1.0 | bioemu-v1.1 | bioemu-v1.2 |
|
| 40 |
+
| ------------------ | --------------------- | ------------------ | ----------------------|
|
| 41 |
+
| Training Data Size | 161k structures (AFDB), 216 ms MD simulations, 19k dG measurements | AFDB and MD same as v1.0, 502k dG measurements | AFDB same as v1.0, 145.4 ms MD simulations, 1.3M dG measurements |
|
| 42 |
+
| Model Parameters | 31.4M | 31.4M | 35.7M |
|
| 43 |
|
| 44 |
|
| 45 |
## Uses
|
|
|
|
| 64 |
- emulation of molecular dynamics (MD) equilibrium distributions
|
| 65 |
- prediction of protein stabilities
|
| 66 |
|
| 67 |
+
For each task we developed a specific combination of testing data and metric, which will be described in the following. For additional details, please refer to the [paper].
|
| 68 |
|
| 69 |
### Testing Data, Factors & Metrics
|
| 70 |
|
| 71 |
#### Testing Data
|
| 72 |
|
| 73 |
For testing **conformational changes**, sets of structures exhibiting different phenomena (local unfolding, domain motion and formation of cryptic pockets) were curated based on PDB reports and published literature. In addition,
|
| 74 |
+
regions affected by the changes were annotated manually ([section 4 of SI][paper]).
|
| 75 |
+
To test **emulation of MD equilibrium distributions**, an in-house dataset of molecular dynamics simulation based on the [CATH classification](https://doi.org/10.1093/nar/gkaa1079) of proteins was generated ([SI 6][paper]).
|
| 76 |
+
**Protein stability predictions** were evaluated using a combination of published experimental folding free energies (https://www.nature.com/articles/s41586-023-06328-6) ([SI 5][paper]).
|
| 77 |
|
| 78 |
+
Details on how the different benchmark datasets were generated can be found in the [paper].
|
| 79 |
|
| 80 |
#### Metrics
|
| 81 |
|
| 82 |
Each task was evaluated using specific metrics:
|
| 83 |
+
- Conformational change tasks were evaluated based on the coverage of reference states. A reference state was counted as covered if at least 0.1 percent of model samples were within a predefined threshold distance of the state, using an appropriate distance measure. The coverage was first averaged over all the reference states corresponding to each sequence, and then averaged over sequences. Coverages for local unfolding and crypic pockets were further classified into folded / unfolded and apo / holo state contributions ([SI 4.3][paper]).
|
| 84 |
+
- MD emulation performance was evaluated by computing time-lagged independent component analysis (TICA) projections of the generated MD data and identifying metastable states by hidden Markov model (HMM) analysis. Model samples were then projected into the same 2D space and assigned to metastable states based on the HMM. Finally, the mean absolute error between the free energies of these states was computed relative to the values obtained from the base MD simulations ([SI 6][paper]).
|
| 85 |
+
- Protein stability prediction was evaluated based on the mean absolute errors and correlation coefficients between experimentally measured folding free energies and model predictions ([SI 5.2][paper]).
|
| 86 |
|
| 87 |
+
In all cases, please refer to the [paper] for details.
|
| 88 |
|
| 89 |
### Results
|
| 90 |
|
| 91 |
+
For tasks investigating **conformational changes**, BioEmu model achieves overall coverages of 83 % for domain motion. Coverage for local unfolding events is 70 % for locally folded and 82 % for locally unfolded states respectively. For cryptic pockets, we observe coverages of 55 % for apo (unbound) and 88 % for holo (bound) states.
|
| 92 |
+
On the **emulation of MD equilibrium distributions**, BioEmu achieves a mean absolute error of 0.9 kcal/mol using the above metric for the in-house dataset.
|
| 93 |
Variants of BioEmu trained and tested on a dataset of fast folding proteins reported previously (https://doi.org/10.1126/science.1208351) achieved a mean absolute error of 0.74 kcal/mol.
|
| 94 |
+
In the **protein stability prediction** tasks, we obtain free energy mean absolute errors of 0.9 kcal/mol relative to experimental measurements. The associated Spearman's correlation coefficient is 0.6.
|
| 95 |
+
These results are reported in the [paper].
|
| 96 |
|
| 97 |
+
All test datasets and code necessary to reproduce these results are released in a separate code package https://github.com/microsoft/bioemu-benchmarks/tree/main. The results from all the released model checkpoints are also included there.
|
| 98 |
|
| 99 |
## Technical Specifications
|
| 100 |
|
| 101 |
### Model Architecture and Objective
|
| 102 |
|
| 103 |
+
BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-024-00837-3) trained on a variety of datasets to sample systematically diverse structure ensembles. In the pretraining phase, we use denoising score matching to match the distribution of flexible protein structures curated from AFDB. In the fine-tuning phase, we use a combination of denoising score matching objective for molecular dynamics data and property prediction fine-tuning (PPFT) for matching the experimental folding free energies. For more details of PPFT, please see our [paper].
|
| 104 |
+
BioEmu-v1.2 model adds extra embedding for residue types and residue pairs.
|
| 105 |
|
| 106 |
#### Software
|
| 107 |
|
|
|
|
| 111 |
|
| 112 |
**BibTeX:**
|
| 113 |
```
|
| 114 |
+
@article{bioemu2025,
|
| 115 |
+
title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
|
| 116 |
+
author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew YK and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E. and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e}, Frank},
|
| 117 |
+
journal={Science},
|
| 118 |
+
pages={eadv9817},
|
| 119 |
+
year={2025},
|
| 120 |
+
publisher={American Association for the Advancement of Science},
|
| 121 |
+
doi={10.1126/science.adv9817}
|
| 122 |
}
|
| 123 |
```
|
| 124 |
|
|
|
|
| 126 |
|
| 127 |
We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected behavior in our technology, please contact us at:
|
| 128 |
- Frank Noe (franknoe@microsoft.com)
|
| 129 |
+
|
| 130 |
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
| 131 |
|
| 132 |
### Out-of-Scope Use
|
|
|
|
| 136 |
The model is intended for research and experimental purposes. Further testing/development are needed before considering its application in real-world scenarios.
|
| 137 |
|
| 138 |
## Bias, Risks, and Limitations
|
| 139 |
+
Our model has been trained on a large variety of structurally resolved proteins, so it inherits the biases of this data (see [paper] for details).
|
| 140 |
The current model has low prediction quality for protein-protein interactions, including multi-chain proteins, and does not feature explicit interactions with other chemical entities like small molecules.
|
| 141 |
Besides experimental data, the model is trained on synthetic data, which is predictions of AlphaFold2 and molecular dynamics simulations.
|
| 142 |
We expect that the approximations of these models are propagated to BioEmu.
|
| 143 |
|
| 144 |
|
| 145 |
### Recommendations
|
| 146 |
+
We recommend using this model only for the purposes specified here or described in the [paper].
|
| 147 |
+
In particular, we advise against predicting entities that are not considered by the used embeddings or represented in the training data, including but not limited to multi-chain proteins.
|
| 148 |
+
|
| 149 |
+
[paper]: https://www.science.org/doi/10.1126/science.adv9817
|