Update readme with bioemu 1.1 and 1.2 model information

#3
by yuuuxie - opened
Files changed (1) hide show
  1. README.md +37 -35
README.md CHANGED
@@ -1,8 +1,5 @@
1
  ---
2
  license: mit
3
- ---
4
- ---
5
- license: mit
6
  license_link: https://opensource.org/license/mit
7
 
8
  doi: https://doi.org/10.1101/2024.12.05.626885
@@ -23,9 +20,9 @@ The model is being released together with its companion BioEmu Benchmark (github
23
 
24
  ### Model Description
25
 
26
- Biomolecular Emulator (BioEmu) is a deep learning model that, given a protein sequence, can sample thousands of statistically independent structures from the protein structure ensemble per hour on a single graphics processing unit. By leveraging novel training methods and vast data of protein structures, over 200 milliseconds of MD simulation, and experimental protein stabilities, BioEmu’s protein ensembles represent equilibrium in a range of challenging and practically relevant metrics. Qualitatively, BioEmu samples many functionally relevant conformational changes, ranging from formation of cryptic pockets, over unfolding of specific protein regions, to large-scale domain rearrangements. Quantitatively, BioEmu samples protein conformations with relative free energy errors around 1 kcal/mol, as validated against millisecond-timescale MD simulation and experimentally-measured protein stabilities. By simultaneously emulating structural ensembles and thermodynamic properties, BioEmu reveals mechanistic insights, such as the causes for fold destabilization of mutants, and can efficiently provide experimentally-testable hypotheses.
27
 
28
- Please refer to the [BioEmu](https://www.biorxiv.org/content/10.1101/2024.12.05.626885) manuscript for more details on the model.
29
 
30
  - **Developed by:** Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Cecilia Clementi, Frank Noé
31
  - **Funded by:** Microsoft Research AI for Science
@@ -35,14 +32,14 @@ Please refer to the [BioEmu](https://www.biorxiv.org/content/10.1101/2024.12.05.
35
  ### Model Sources
36
 
37
  - **Repository:** https://github.com/microsoft/bioemu
38
- - **Paper:** https://www.biorxiv.org/content/10.1101/2024.12.05.626885
39
 
40
  ### Available Models
41
 
42
- | | bioemu-v1.0 |
43
- | ------------------ | --------------------- |
44
- | Training Data Size | 161k structures (AFDB), 216 ms MD simulations, 19k dG measurements |
45
- | Model Parameters | 31M |
46
 
47
 
48
  ## Uses
@@ -67,42 +64,44 @@ We evaluated model performance on the following tasks:
67
  - emulation of molecular dynamics (MD) equilibrium distributions
68
  - prediction of protein stabilities
69
 
70
- For each task we developed a specific combination of testing data and metric, which will be described in the following. For additional details, please refer to the [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885).
71
 
72
  ### Testing Data, Factors & Metrics
73
 
74
  #### Testing Data
75
 
76
  For testing **conformational changes**, sets of structures exhibiting different phenomena (local unfolding, domain motion and formation of cryptic pockets) were curated based on PDB reports and published literature. In addition,
77
- regions affected by the changes were annotated manually ([section 4 of the manuscript SI](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
78
- To test **emulation of MD equilibrium distributions**, an in-house dataset of molecular dynamics simulation based on the [CATH classification](https://doi.org/10.1093/nar/gkaa1079) of proteins was generated ([SI 6](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
79
- **Protein stability predictions** were evaluated using a combination of published experimental folding free energies (https://www.nature.com/articles/s41586-023-06328-6) ([SI 5](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
80
 
81
- Details on how the different benchmark datasets were generated can be found in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885).
82
 
83
  #### Metrics
84
 
85
  Each task was evaluated using specific metrics:
86
- - Conformational change tasks were evaluated based on the coverage of reference states. A reference state was counted as covered if at least 0.1 percent of model samples were within a predefined threshold distance of the state, using an appropriate distance measure. The coverage was first averaged over all the reference states corresponding to each sequence, and then averaged over sequences. Coverages for local unfolding and crypic pockets were further classified into folded / unfolded and apo / holo state contributions ([SI 4.3](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
87
- - MD emulation performance was evaluated by computing time-lagged independent component analysis (TICA) projections of the generated MD data and identifying metastable states by hidden Markov model (HMM) analysis. Model samples were then projected into the same 2D space and assigned to metastable states based on the HMM. Finally, the mean absolute error between the free energies of these states was computed relative to the values obtained from the base MD simulations ([SI 6](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
88
- - Protein stability prediction was evaluated based on the mean absolute errors and correlation coefficients between experimentally measured folding free energies and model predictions ([SI 5.2](https://www.biorxiv.org/content/10.1101/2024.12.05.626885)).
89
 
90
- In all cases, please refer to the [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885) for details.
91
 
92
  ### Results
93
 
94
- For tasks investigating **conformational changes**, BioEmu model achieves overall coverages of 85 % for domain motion. Coverage for local unfolding events is 72% for locally folded and 74% for locally unfolded states respectively. For cryptic pockets, we observe coverages of 49 % for apo (unbound) and 85 % for holo (bound) states.
95
- On the **emulation of MD equilibrium distributions**, BioEmu achieves a mean absolute error of 0.91 kcal/mol using the above metric for the in-house dataset.
96
  Variants of BioEmu trained and tested on a dataset of fast folding proteins reported previously (https://doi.org/10.1126/science.1208351) achieved a mean absolute error of 0.74 kcal/mol.
97
- In the **protein stability prediction** tasks, we obtain free energy mean absolute errors of 0.76 kcal/mol relative to experimental measurements. The associated Pearson correlation coefficient is 0.66, and the Spearman's correlation coefficient is 0.64.
 
98
 
99
- All test datasets and code necessary to reproduce these results will be released in a separate code package.
100
 
101
  ## Technical Specifications
102
 
103
  ### Model Architecture and Objective
104
 
105
- BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-024-00837-3) trained on a variety of datasets to sample systematically diverse structure ensembles. In the pretraining phase, we use denoising score matching to match the distribution of flexible protein structures curated from AFDB. In the fine-tuning phase, we use a combination of denoising score matching objective for molecular dynamics data and property prediction fine-tuning (PPFT) for matching the experimental folding free energies. For more details of PPFT, please see our [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885).
 
106
 
107
  #### Software
108
 
@@ -112,13 +111,14 @@ BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-
112
 
113
  **BibTeX:**
114
  ```
115
- @misc{lewis_scalable_2024,
116
- title = {Scalable Emulation of Protein Equilibrium Ensembles with Generative Deep Learning},
117
- author = {Lewis, Sarah and Hempel, Tim and Jim{\'e}nez Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew Y. K. and Garc{\'i}a Satorras, Victor and Abdin, Osama and Veeling, Bastiaan S. and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper, Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Clementi, Cecilia and No{\'e}, Frank},
118
- year = {2024},
119
- doi = {10.1101/2024.12.05.626885},
120
- archiveprefix = {BioRXiv},
121
- url = {https://www.biorxiv.org/content/10.1101/2024.12.05.626885}
 
122
  }
123
  ```
124
 
@@ -126,7 +126,7 @@ BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-
126
 
127
  We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected behavior in our technology, please contact us at:
128
  - Frank Noe (franknoe@microsoft.com)
129
- - Ryota Tomioka (ryoto@microsoft.com)
130
  If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
131
 
132
  ### Out-of-Scope Use
@@ -136,12 +136,14 @@ The model does not support generation of new protein sequences as it is designed
136
  The model is intended for research and experimental purposes. Further testing/development are needed before considering its application in real-world scenarios.
137
 
138
  ## Bias, Risks, and Limitations
139
- Our model has been trained on a large variety of structurally resolved proteins, so it inherits the biases of this data (see [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885) for details).
140
  The current model has low prediction quality for protein-protein interactions, including multi-chain proteins, and does not feature explicit interactions with other chemical entities like small molecules.
141
  Besides experimental data, the model is trained on synthetic data, which is predictions of AlphaFold2 and molecular dynamics simulations.
142
  We expect that the approximations of these models are propagated to BioEmu.
143
 
144
 
145
  ### Recommendations
146
- We recommend using this model only for the purposes specified here or described in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.12.05.626885).
147
- In particular, we advice against predicting entities that are not considered by the used embeddings or represented in the training data, including but not limited to multi-chain proteins.
 
 
 
1
  ---
2
  license: mit
 
 
 
3
  license_link: https://opensource.org/license/mit
4
 
5
  doi: https://doi.org/10.1101/2024.12.05.626885
 
20
 
21
  ### Model Description
22
 
23
+ Biomolecular Emulator (BioEmu) is a deep learning model that emulates protein structural ensembles at a speed that is orders of magnitude faster than traditional molecular dynamics simulations. By leveraging novel training methods and vast data of protein structures, over 200 milliseconds of MD simulation, and experimental protein stabilities, BioEmu’s protein structural ensembles represent equilibrium in a range of challenging and practically relevant metrics. Qualitatively, BioEmu samples many functionally relevant conformational changes, ranging from formation of cryptic pockets, over unfolding of specific protein regions, to large-scale domain rearrangements. Quantitatively, BioEmu samples protein conformations with relative free energy errors around 1 kcal/mol, as validated against millisecond-timescale MD simulation and experimentally-measured protein stabilities. By simultaneously emulating structural ensembles and thermodynamic properties, BioEmu reveals mechanistic insights, such as the causes for fold destabilization of mutants, and can efficiently provide experimentally-testable hypotheses.
24
 
25
+ Please refer to the BioEmu [paper] for more details on the model.
26
 
27
  - **Developed by:** Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Cecilia Clementi, Frank Noé
28
  - **Funded by:** Microsoft Research AI for Science
 
32
  ### Model Sources
33
 
34
  - **Repository:** https://github.com/microsoft/bioemu
35
+ - **Paper:** https://www.science.org/doi/10.1126/science.adv9817
36
 
37
  ### Available Models
38
 
39
+ | | bioemu-v1.0 | bioemu-v1.1 | bioemu-v1.2 |
40
+ | ------------------ | --------------------- | ------------------ | ----------------------|
41
+ | Training Data Size | 161k structures (AFDB), 216 ms MD simulations, 19k dG measurements | AFDB and MD same as v1.0, 502k dG measurements | AFDB same as v1.0, 145.4 ms MD simulations, 1.3M dG measurements |
42
+ | Model Parameters | 31.4M | 31.4M | 35.7M |
43
 
44
 
45
  ## Uses
 
64
  - emulation of molecular dynamics (MD) equilibrium distributions
65
  - prediction of protein stabilities
66
 
67
+ For each task we developed a specific combination of testing data and metric, which will be described in the following. For additional details, please refer to the [paper].
68
 
69
  ### Testing Data, Factors & Metrics
70
 
71
  #### Testing Data
72
 
73
  For testing **conformational changes**, sets of structures exhibiting different phenomena (local unfolding, domain motion and formation of cryptic pockets) were curated based on PDB reports and published literature. In addition,
74
+ regions affected by the changes were annotated manually ([section 4 of SI][paper]).
75
+ To test **emulation of MD equilibrium distributions**, an in-house dataset of molecular dynamics simulation based on the [CATH classification](https://doi.org/10.1093/nar/gkaa1079) of proteins was generated ([SI 6][paper]).
76
+ **Protein stability predictions** were evaluated using a combination of published experimental folding free energies (https://www.nature.com/articles/s41586-023-06328-6) ([SI 5][paper]).
77
 
78
+ Details on how the different benchmark datasets were generated can be found in the [paper].
79
 
80
  #### Metrics
81
 
82
  Each task was evaluated using specific metrics:
83
+ - Conformational change tasks were evaluated based on the coverage of reference states. A reference state was counted as covered if at least 0.1 percent of model samples were within a predefined threshold distance of the state, using an appropriate distance measure. The coverage was first averaged over all the reference states corresponding to each sequence, and then averaged over sequences. Coverages for local unfolding and crypic pockets were further classified into folded / unfolded and apo / holo state contributions ([SI 4.3][paper]).
84
+ - MD emulation performance was evaluated by computing time-lagged independent component analysis (TICA) projections of the generated MD data and identifying metastable states by hidden Markov model (HMM) analysis. Model samples were then projected into the same 2D space and assigned to metastable states based on the HMM. Finally, the mean absolute error between the free energies of these states was computed relative to the values obtained from the base MD simulations ([SI 6][paper]).
85
+ - Protein stability prediction was evaluated based on the mean absolute errors and correlation coefficients between experimentally measured folding free energies and model predictions ([SI 5.2][paper]).
86
 
87
+ In all cases, please refer to the [paper] for details.
88
 
89
  ### Results
90
 
91
+ For tasks investigating **conformational changes**, BioEmu model achieves overall coverages of 83 % for domain motion. Coverage for local unfolding events is 70 % for locally folded and 82 % for locally unfolded states respectively. For cryptic pockets, we observe coverages of 55 % for apo (unbound) and 88 % for holo (bound) states.
92
+ On the **emulation of MD equilibrium distributions**, BioEmu achieves a mean absolute error of 0.9 kcal/mol using the above metric for the in-house dataset.
93
  Variants of BioEmu trained and tested on a dataset of fast folding proteins reported previously (https://doi.org/10.1126/science.1208351) achieved a mean absolute error of 0.74 kcal/mol.
94
+ In the **protein stability prediction** tasks, we obtain free energy mean absolute errors of 0.9 kcal/mol relative to experimental measurements. The associated Spearman's correlation coefficient is 0.6.
95
+ These results are reported in the [paper].
96
 
97
+ All test datasets and code necessary to reproduce these results are released in a separate code package https://github.com/microsoft/bioemu-benchmarks/tree/main. The results from all the released model checkpoints are also included there.
98
 
99
  ## Technical Specifications
100
 
101
  ### Model Architecture and Objective
102
 
103
+ BioEmu-v1 model is **DiG** architecture (https://www.nature.com/articles/s42256-024-00837-3) trained on a variety of datasets to sample systematically diverse structure ensembles. In the pretraining phase, we use denoising score matching to match the distribution of flexible protein structures curated from AFDB. In the fine-tuning phase, we use a combination of denoising score matching objective for molecular dynamics data and property prediction fine-tuning (PPFT) for matching the experimental folding free energies. For more details of PPFT, please see our [paper].
104
+ BioEmu-v1.2 model adds extra embedding for residue types and residue pairs.
105
 
106
  #### Software
107
 
 
111
 
112
  **BibTeX:**
113
  ```
114
+ @article{bioemu2025,
115
+ title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
116
+ author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew YK and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E. and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e}, Frank},
117
+ journal={Science},
118
+ pages={eadv9817},
119
+ year={2025},
120
+ publisher={American Association for the Advancement of Science},
121
+ doi={10.1126/science.adv9817}
122
  }
123
  ```
124
 
 
126
 
127
  We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected behavior in our technology, please contact us at:
128
  - Frank Noe (franknoe@microsoft.com)
129
+
130
  If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
131
 
132
  ### Out-of-Scope Use
 
136
  The model is intended for research and experimental purposes. Further testing/development are needed before considering its application in real-world scenarios.
137
 
138
  ## Bias, Risks, and Limitations
139
+ Our model has been trained on a large variety of structurally resolved proteins, so it inherits the biases of this data (see [paper] for details).
140
  The current model has low prediction quality for protein-protein interactions, including multi-chain proteins, and does not feature explicit interactions with other chemical entities like small molecules.
141
  Besides experimental data, the model is trained on synthetic data, which is predictions of AlphaFold2 and molecular dynamics simulations.
142
  We expect that the approximations of these models are propagated to BioEmu.
143
 
144
 
145
  ### Recommendations
146
+ We recommend using this model only for the purposes specified here or described in the [paper].
147
+ In particular, we advise against predicting entities that are not considered by the used embeddings or represented in the training data, including but not limited to multi-chain proteins.
148
+
149
+ [paper]: https://www.science.org/doi/10.1126/science.adv9817