Spaces:
Running
Running
File size: 11,179 Bytes
e76b79a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | # Inverse folding with ESM-IF1
The ESM-IF1 inverse folding model is built for predicting protein sequences
from their backbone atom coordinates. We provide scripts here 1) to sample sequence
designs for a given structure and 2) to score sequences for a given structure.
Trained with 12M protein structures predicted by AlphaFold2, the ESM-IF1
model consists of invariant geometric input processing layers followed by a
sequence-to-sequence transformer, and achieves 51% native sequence recovery on
structurally held-out backbones with 72% recovery for buried residues.
The model is also trained with span masking to tolerate missing backbone
coordinates and therefore can predict sequences for partially masked structures.
More details in our bioRxiv [pre-print](https://doi.org/10.1101/2022.04.10.487779).

## Recommended environment
It is highly recommended to start a new conda environment from scratch due to
potential CUDA compatability issues between pytorch and the pytorch-geometric
package required for the inverse folding model.
To set up a new conda environment with required packages,
```
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip
pip install biotite
pip install git+https://github.com/facebookresearch/esm.git
```
## Quickstart
### Sample sequence designs for a given structure
To sample sequences for a given structure in PDB or mmCIF format, use the
`sample_sequences.py` script. The input file can have either `.pdb` or
`.cif` as suffix.
For example, to sample 3 sequence designs for the golgi casein kinase structure
(PDB [5YH2](https://www.rcsb.org/structure/5yh2); [PDB Molecule of the Month
from January 2022](https://pdb101.rcsb.org/motm/265)), we can run the following
command from the `examples/inverse_folding` directory:
```
python sample_sequences.py data/5YH2.pdb \
--chain C --temperature 1 --num-samples 3 \
--outpath output/sampled_sequences.fasta
```
The sampled sequences will be saved in a fasta format to the specified output file.
**By default, the script only loads the backbone of the specified target chain
as model input.** To instead use the entire complex backbone as model input for
conditioning, use the `--multichain-backbone` flag to load all chains. (In the
example below, the encoder loads the backbone of all chains as input to the
encoder, and the decoder samples sequences for chain C.)
```
python sample_sequences.py data/5YH2.pdb \
--chain C --temperature 1 --num-samples 3 \
--outpath output/sampled_sequences_multichain.fasta \
--multichain-backbone
```
The temperature parameter controls the sharpness of the probability
distribution for sequence sampling. Higher sampling temperatures yield more
diverse sequences but likely with lower native sequence recovery.
The default sampling temperature is 1. To optimize for native sequence
recovery, we recommend sampling with low temperature such as 1e-6.
**We recommend trying both the single-chain and multi-chain design modes.** While in
our paper we showed that conditioning on the entire multi-chain backbone often
reduces perplexity and increases sequence recovery, on some proteins the
single-chain performance is better.
Sometimes, one failure mode in sampled sequences is a high number of repeated
amino acids, e.g. `EEEEEEEE`. We recommend checking for that and filtering out
sampled sequences with long repeats.
### Scoring sequences
To score the conditional log-likelihoods for sequences conditioned on a given
structure, use the `score_log_likelihoods.py` script.
For example, to score the sequences in `data/5YH2_mutated_seqs.fasta`
according to the structure in `data/5YH2.pdb`, we can run
the following command from the `examples/inverse_folding` directory:
```
python score_log_likelihoods.py data/5YH2.pdb \
data/5YH2_mutated_seqs.fasta --chain C \
--outpath output/5YH2_mutated_seqs_scores.csv
```
The conditional log-likelihoods are saved in a csv format in the specified output path.
The output values are the average log-likelihoods averaged over all amino acids in a sequence.
**By default, the script only loads the backbone of the specified target chain
as model input.** To instead use the entire complex backbone as model input for
conditioning, use the `--multichain-backbone` flag to load all chains. (In the
example below, the encoder loads the backbone of all chains as input to the
encoder, and the decoder scores sequences for chain C.)
```
python score_log_likelihoods.py data/5YH2.pdb \
data/5YH2_mutated_seqs.fasta --chain C \
--outpath output/5YH2_mutated_seqs_scores.csv \
--multichain-backbone
```
We recommend trying both the single-chain and multi-chain design modes. While in
our paper we showed that conditioning on the entire multi-chain backbone often
reduces perplexity and increases sequence recovery, on some proteins the
single-chain performance is better.
## General usage
### Load model
The `esm_if1_gvp4_t16_142M_UR50` function loads the pretrained model and its
corresponding alphabet. The alphabet represents the amino acids and the special
tokens encoded by the model.
**Update**: It is important to set the model in eval mode to avoid random
dropout from training mode for best performance.
```
import esm.inverse_folding
model, alphabet = esm.pretrained.esm_if1_gvp4_t16_142M_UR50()
model = model.eval()
```
### Input format
The input to the model is a list of backbone atom coordinates for the N, CA, C
atoms in each amino acid. For each structure, the coordinate list `coords` would
be of shape L x 3 x 3, where L is the number of amino acids in the structure.
`coords[i][0]` is the 3D coordinate for the N atom in amino acid `i`,
`coords[i][1]` is the 3D coordinate for the CA atom in amino acid `i`, and
`coords[i][2]` is the 3D coordinate for the C atom in amino acid `i`.
### Load input data from PDB and mmCIF file formats
To load a single chain from PDB and mmCIF file formats and extract the backbone
coordinates of the N, CA, C atoms as model input,
```
import esm.inverse_folding
structure = esm.inverse_folding.util.load_structure(fpath, chain_id)
coords, seq = esm.inverse_folding.util.extract_coords_from_structure(structure)
```
Note this only loads the specified chain.
To load multiple chains for the multichain complex use cases, list all chain ids
when loading the structure, e.g. `chain_ids = ['A', 'B', 'C']`:
```
structure = esm.inverse_folding.util.load_structure(fpath, chain_ids)
coords, native_seqs = esm.inverse_folding.multichain_util.extract_coords_from_complex(structure)
```
### Example Jupyter notebook
See `examples/inverse_folding/notebook.ipynb` for examples of sampling sequences,
calculating conditional log-likelihoods, and extracting encoder output as
structure representation (on a single chain).
This notebook is also available on colab:
[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/facebookresearch/esm/blob/master/examples/inverse_folding/notebook.ipynb)
For multichain complexes, ESM-IF1 can design sequences for a specific chain in the complex,
conditioned on the backbone structure of the entire multichain complex.
See `examples/inverse_folding/notebook_multichain.ipynb` for sequence design and sequence scoring in multichain complexes, or find the notebook on colab:
[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/facebookresearch/esm/blob/master/examples/inverse_folding/notebook_multichain.ipynb)
### Sample sequence designs
To sample sequences for a given set of backbone coordinates for a single chain,
```
sampled_seq = model.sample(coords, temperature=T)
```
where `coords` is an array as described in the above section on input format.
To sample sequences for a given chain in a multichain complex,
```
import esm.inverse_folding
sampled_seq = esm.inverse_folding.multichain_util.sample_sequence_in_complex(
model, coords, target_chain_id, temperature=T
)
```
where `coords` is a dictionary mapping chain ids to backbone coordinate arrays.
The temperature parameter controls the ``sharpness`` of the probability
distribution for sequence sampling. Higher sampling temperatures yield more
diverse sequences but likely with lower native sequence recovery.
The default sampling temperature is `T=1`. To optimize for native sequence
recovery, we recommend sampling with low temperature such as `T=1e-6`.
### Scoring sequences
To score the conditional log-likelihoods for sequences conditioned on a given
set of backbone coordinates for a single chain, use the `score_sequence` function,
```
ll_fullseq, ll_withcoord = esm.inverse_folding.util.score_sequence(model, alphabet, coords, seq)
```
The first returned value ``ll_fullseq`` is the average log-likelihood averaged
over all amino acids in a sequence.
The second return value ``ll_withcoord`` is averaged only over those amino acids
with associated backbone coordinates in the input, i.e., excluding those with
missing backbone coordinates.
For multichain complexes,
```
ll_fullseq, ll_withcoord = esm.inverse_folding.multichain_util.score_sequence_in_complex(
model, alphabet, coords, target_chain_id, target_seq
)
```
where `coords` is a dictionary mapping chain ids to backbone coordinate arrays.
### Partially masking backbone coordinates
To mask a parts of the input backbone coordinates, simply set those coordinate
values to `np.inf`. For example, to mask the backbone coordinates for the first
ten amino acid in the structure,
```
coords[:10, :] = float('inf')
```
### Encoder output as structure representation
To extract the encoder output as structure representation,
```
rep = esm.inverse_folding.util.get_encoder_output(model, alphabet, coords)
```
For a set of input coordinates with L amino acids, the encoder output will have
shape L x 512.
Or, for multichain complex,
```
rep = esm.inverse_folding.multichain_util.get_encoder_output_for_complex(
model, alphabet, coords, target_chain_id
)
```
## Data split
The CATH v4.3 data are available at the following links:
- [Backbone coordinates and sequences](https://dl.fbaipublicfiles.com/fair-esm/data/cath4.3_topologysplit_202206/chain_set.jsonl)
- [Split](https://dl.fbaipublicfiles.com/fair-esm/data/cath4.3_topologysplit_202206/splits.json)
That's it for now, have fun!
## Acknowledgements
The invariant geometric input processing layers are from the [Geometric Vector
Perceptron PyTorch repo](https://github.com/drorlab/gvp-pytorch) by Bowen Jing,
Stephan Eismann, Pratham Soni, Patricia Suriana, Raphael Townshend, and Ron
Dror.
The input data pipeline is adapted from the [Geometric Vector Perceptron PyTorch
repo](https://github.com/drorlab/gvp-pytorch) and the [Generative Models for
Graph-Based Protein Design
repo](https://github.com/jingraham/neurips19-graph-protein-design) by John
Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola.
The Transformer implementation is adapted from
[fairseq](https://github.com/pytorch/fairseq).
|