{ "cells": [ { "cell_type": "markdown", "id": "338533a8", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "43GQPw7R_ZbY", "metadata": { "id": "43GQPw7R_ZbY" }, "source": [ "# Inverse folding with ESM-IF1\n", "\n", "The ESM-IF1 inverse folding model is built for predicting protein sequences from their backbone atom coordinates. We provide examples here 1) to sample sequence designs for a given structure and 2) to score sequences for a given structure.\n", "\n", "Trained with 12M protein structures predicted by AlphaFold2, the ESM-IF1 model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer, and achieves 51% native sequence recovery on structurally held-out backbones. The model is also trained with span masking to tolerate missing backbone coordinates and therefore can predict sequences for partially masked structures.\n", "\n", "See [GitHub README](https://github.com/facebookresearch/esm/tree/main/examples/inverse_folding) for the complete user guide, and see our [bioRxiv pre-print](https://doi.org/10.1101/2022.04.10.487779) for more details." ] }, { "cell_type": "markdown", "id": "8c6TDUVupgNn", "metadata": { "id": "8c6TDUVupgNn" }, "source": [ "## Environment setup (colab)\n", "This step might take up to 10 minutes the first time. \n", "\n", "If using a local jupyter environment, instead of the following, we recommend configuring a conda environment upon first use in command line:\n", "```\n", "conda create -n inverse python=3.9\n", "conda activate inverse\n", "conda install pytorch cudatoolkit=11.3 -c pytorch\n", "conda install pyg -c pyg -c conda-forge\n", "conda install pip\n", "pip install biotite\n", "pip install git+https://github.com/facebookresearch/esm.git\n", "```\n", "\n", "Afterwards, `conda activate inverse` to activate this environment before starting `jupyter notebook`.\n", "\n", "Below is the setup for colab notebooks:\n", "\n", "We recommend using GPU runtimes on colab (Menu bar -> Runtime -> Change runtime type -> Hardware accelerator -> GPU)" ] }, { "cell_type": "code", "execution_count": 1, "id": "nLOymXwdwUXo", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nLOymXwdwUXo", "outputId": "894a9af0-d8e9-44f3-e869-830a3e3e90ad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K |████████████████████████████████| 7.9 MB 16.8 MB/s \n", "\u001b[K |████████████████████████████████| 3.5 MB 27.7 MB/s \n", "\u001b[K |████████████████████████████████| 2.5 MB 35.7 MB/s \n", "\u001b[K |████████████████████████████████| 750 kB 25.8 MB/s \n", "\u001b[K |████████████████████████████████| 407 kB 30.7 MB/s \n", "\u001b[?25h Building wheel for torch-geometric (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for fair-esm (PEP 517) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[K |████████████████████████████████| 31.8 MB 1.2 MB/s \n", "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for biotite (PEP 517) ... \u001b[?25l\u001b[?25hdone\n" ] } ], "source": [ "# Colab environment setup\n", "\n", "# Install the correct version of Pytorch Geometric.\n", "import torch\n", "\n", "def format_pytorch_version(version):\n", " return version.split('+')[0]\n", "\n", "TORCH_version = torch.__version__\n", "TORCH = format_pytorch_version(TORCH_version)\n", "\n", "def format_cuda_version(version):\n", " return 'cu' + version.replace('.', '')\n", "\n", "CUDA_version = torch.version.cuda\n", "CUDA = format_cuda_version(CUDA_version)\n", "\n", "!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html\n", "!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html\n", "!pip install -q torch-cluster -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html\n", "!pip install -q torch-spline-conv -f https://data.pyg.org/whl/torch-{TORCH}+{CUDA}.html\n", "!pip install -q torch-geometric\n", "\n", "# Install esm\n", "!pip install -q git+https://github.com/facebookresearch/esm.git\n", "\n", "# Install biotite\n", "!pip install -q biotite" ] }, { "cell_type": "markdown", "id": "EhDI4ZIX0Z4w", "metadata": { "id": "EhDI4ZIX0Z4w" }, "source": [ "### Verify that pytorch-geometric is correctly installed\n", "\n", "If the notebook crashes at the import, there is likely an issue with the version of torch_geometric and torch_sparse being incompatible with the torch version." ] }, { "cell_type": "code", "execution_count": 2, "id": "1-HvBXt1wWPu", "metadata": { "id": "1-HvBXt1wWPu" }, "outputs": [], "source": [ "## Verify that pytorch-geometric is correctly installed\n", "import torch_geometric\n", "import torch_sparse\n", "from torch_geometric.nn import MessagePassing" ] }, { "cell_type": "markdown", "id": "18544eee", "metadata": { "id": "18544eee" }, "source": [ "## Load model\n", "This steps takes a few minutes for the model to download.\n", "\n", "**UPDATE**: It is important to set the model in eval mode through `model = model.eval()` to disable random dropout for optimal performance." ] }, { "cell_type": "code", "execution_count": 3, "id": "14d8f393", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "14d8f393", "outputId": "d84afad9-0cd7-4e96-bedd-2bdae90229b0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/chloehsu/inverse_opensource/esm_public_fork/esm/esm/pretrained.py:172: UserWarning: Regression weights not found, predicting contacts will not produce correct results.\n", " warnings.warn(\n" ] } ], "source": [ "import esm\n", "model, alphabet = esm.pretrained.esm_if1_gvp4_t16_142M_UR50()\n", "model = model.eval()" ] }, { "cell_type": "markdown", "id": "2a6ed9bb", "metadata": { "id": "2a6ed9bb" }, "source": [ "## Load structure from PDB or CIF files\n", "\n", "As an example, let's look at Golgi Casein Kinase, the [PDB Molecule of the Month from January 2022](https://pdb101.rcsb.org/motm/265).\n", "\n", "Milk is a complex mixture of proteins, fats, and nutrients that provides everything that a growing infant needs. Most of the protein in cow’s milk is casein, whereas human milk has lesser amounts of casein. \n", "\n", "The Golgi casein kinase (PDB entry 5YH2) adds phosphates to casein and also to many other types of secreted proteins. It is most active as a complex of two similar types of proteins. Fam20C is the catalytic subunit. It binds to casein and transfers a phosphate from ATP to the protein.\n", "\n", "In this example, let's focus on chain C (the catalytic subunit Fam20C).\n", "\n", "You may also upload your own CIF or PDB file and specify the chain id." ] }, { "cell_type": "code", "execution_count": 4, "id": "a8hwuySfBDig", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "a8hwuySfBDig", "outputId": "fbe2c984-3e79-4b56-981f-cc11441e9d8e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-05-27 12:53:21-- https://files.rcsb.org/download/5YH2.cif\n", "Resolving files.rcsb.org (files.rcsb.org)... 132.249.210.222\n", "Connecting to files.rcsb.org (files.rcsb.org)|132.249.210.222|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: unspecified [application/octet-stream]\n", "Saving to: ‘data/5YH2.cif.1’\n", "\n", "5YH2.cif.1 [ <=> ] 1.39M --.-KB/s in 0.09s \n", "\n", "2022-05-27 12:53:21 (15.5 MB/s) - ‘data/5YH2.cif.1’ saved [1456779]\n", "\n" ] } ], "source": [ "!wget https://files.rcsb.org/download/5YH2.cif -P data/ # save this to the data folder in colab" ] }, { "cell_type": "markdown", "id": "h7EdhncQCWtU", "metadata": { "id": "h7EdhncQCWtU" }, "source": [ "Load chain C from this CIF file:" ] }, { "cell_type": "code", "execution_count": 5, "id": "f4d17649", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f4d17649", "outputId": "42bf8119-8bdf-4612-f7cb-928b74f1b2a5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 6 chains: ['A' 'C' 'B' 'D' 'A' 'B'] \n", "\n", "Loaded chain C\n", "\n", "Native sequence:\n", "SVLQSLFEHPLYRTVLPDLTEEDTLFNLNAEIRLYPKAASESYPNWLRFHIGINRYELYSRHNPVIAALLRDLLSQKISSVGMKSGGTQLKLIMSFQNYGQALFKPMKQTREQETPPDFFYFSDFERHNAEIAAFHLDRILDFRRVPPVAGRLVNMTREIRDVTRDKKLWRTFFVSPANNICFYGECSYYCSTEHALCGKPDQIEGSLAAFLPDLALAKRKTWRNPWRRSYHKRKKAEWEVDPDYCDEVKQTPPYDRGTRLLDIMDMTIFDFLMGNMDRHHYETFEKFGNDTFIIHLDNGRGFGKHSHDEMSILVPLTQCCRVKRSTYLRLQLLAKEEYKLSSLMEESLLQDRLVPVLIKPHLEALDRRLRLVLKVLSDCVEKDGFSAVVENDLD\n" ] } ], "source": [ "fpath = 'data/5YH2.cif' # .pdb format is also acceptable\n", "chain_id = 'C'\n", "structure = esm.inverse_folding.util.load_structure(fpath, chain_id)\n", "coords, native_seq = esm.inverse_folding.util.extract_coords_from_structure(structure)\n", "print('Native sequence:')\n", "print(native_seq)" ] }, { "cell_type": "markdown", "id": "2YmG4jc5CCUJ", "metadata": { "id": "2YmG4jc5CCUJ" }, "source": [ "Visualize chain C in this CIF file:" ] }, { "cell_type": "code", "execution_count": 6, "id": "V4BbLME4DbZQ", "metadata": { "id": "V4BbLME4DbZQ" }, "outputs": [], "source": [ "!pip install -q py3Dmol" ] }, { "cell_type": "code", "execution_count": 7, "id": "PDUHDRMMBbZu", "metadata": { "id": "PDUHDRMMBbZu" }, "outputs": [], "source": [ "try:\n", " import py3Dmol\n", "\n", " def view_pdb(fpath, chain_id):\n", " with open(fpath) as ifile:\n", " system = \"\".join([x for x in ifile])\n", "\n", " view = py3Dmol.view(width=600, height=400)\n", " view.addModelsAsFrames(system)\n", " view.setStyle({'model': -1, 'chain': chain_id}, {\"cartoon\": {'color': 'spectrum'}})\n", " view.zoomTo()\n", " view.show()\n", "\n", "except ImportError:\n", " def view_pdb(fpath, chain_id):\n", " print(\"Install py3Dmol to visualize, or use pymol\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "Q9MJYWMJJxVx", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 417 }, "id": "Q9MJYWMJJxVx", "outputId": "ddd957ea-7512-43bf-f730-4df188bf3393" }, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n
\n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "view_pdb(fpath, chain_id)" ] }, { "cell_type": "markdown", "id": "oKqm6n9HCk4Q", "metadata": { "id": "oKqm6n9HCk4Q" }, "source": [ "Sample sequence and calculate sequence recovery:" ] }, { "cell_type": "code", "execution_count": 9, "id": "50bb20b3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "50bb20b3", "outputId": "27c6cb60-64b8-4f97-c0dd-c6e66b7af5e4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sampled sequence: STLKQWFHSPLTLNDDPYSSAEDVLFSPTRVLSLLPALQGRDLPDFILFWTGIDKFHMYKIDSPVISSLRKRLRTERVREVQQLMGGSKLVLKFTFEDLGAAMFKPVAASTSEETPPHWEFWEVKKTSRAVVASWHLDYLFQLKSTAPSAGRVLNLVRDIRDVTSDSQLQSTFVQTPSRQLCFYGSNTFRSNLRDAICGNRDEIWGSVIAELPHANVALRDVYTSPWKRSKTKFKISLWRTKPDYAKYIRLTPLFKEGDNFLELQKAFIFDYLQGNDDRAKWVAFQKFGVDQELLICDNGAGFGRFDHLDMEILSPLQQGAYFSQNLYDHVIALQSSDYSMSRLMKSVLKADERYPILSEPYLLQLDTRLETVRKILDDCVREIGSNNSLIVEKL\n", "Sequence recovery: 0.3468354430379747\n" ] } ], "source": [ "import numpy as np\n", "\n", "sampled_seq = model.sample(coords, temperature=1)\n", "print('Sampled sequence:', sampled_seq)\n", "\n", "recovery = np.mean([(a==b) for a, b in zip(native_seq, sampled_seq)])\n", "print('Sequence recovery:', recovery)" ] }, { "cell_type": "code", "execution_count": 10, "id": "c82f7a48", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c82f7a48", "outputId": "cfb96fa3-c7ad-4331-c070-0ef45b467608" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sampled sequence: ERLRAWWASPLTQLPDPGLSEEDLLFDPEELLALLPEEEEEELPAWLRFWTGIRRRRLYERESPDVEELLRRLRTARVRRVGQKSGGRSLVLRFEFEDLGSAGFKPRVAELDEETPPEWGFWEVLQRARAVVAAYRLDRLLDLRQVPPAAGRRLDLVTELRDVTDDEELRSTFFVTPEEELCFYGRCEFRCDREHALCGRPDVVEGALVAELPDERIAPRGVYLNPWAHARERDVEALWEVDPDYCEYVRRLPPFREGRLLLELANAYVFDFLMGNADRHTFSTFERFGLDTFLLLLDNGFGFGRADYLDERILRPLEQCCLLSERLYRRLLALSEEEFSLEELMEEELGRDELWPVLARPFLRQLDRRLRRVLEVLEECEEECGREEVLVREEG\n", "Sequence recovery: 0.4582278481012658\n" ] } ], "source": [ "# Lower sampling temperature typically results in higher sequence recovery but less diversity\n", "\n", "sampled_seq = model.sample(coords, temperature=1e-6)\n", "print('Sampled sequence:', sampled_seq)\n", "\n", "recovery = np.mean([(a==b) for a, b in zip(native_seq, sampled_seq)])\n", "print('Sequence recovery:', recovery)" ] }, { "cell_type": "markdown", "id": "310b282a", "metadata": { "id": "310b282a" }, "source": [ "## Conditional sequence log-likelihoods for given backbone coordinates\n", "\n", "The log-likelihood scores could be used to predict mutational effects. See also our [script](https://github.com/facebookresearch/esm/tree/main/examples/inverse_folding#scoring-sequences) for batch scoring," ] }, { "cell_type": "code", "execution_count": 11, "id": "c7d41032", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c7d41032", "outputId": "6919b9bf-9abb-44c0-8c40-446f4dcf63a8", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average log-likelihood on entire sequence: -1.61 (perplexity 5.01)\n", "average log-likelihood excluding missing coordinates: -1.61 (perplexity 5.01)\n" ] } ], "source": [ "ll_fullseq, ll_withcoord = esm.inverse_folding.util.score_sequence(model, alphabet, coords, native_seq)\n", "\n", "print(f'average log-likelihood on entire sequence: {ll_fullseq:.2f} (perplexity {np.exp(-ll_fullseq):.2f})')\n", "print(f'average log-likelihood excluding missing coordinates: {ll_withcoord:.2f} (perplexity {np.exp(-ll_withcoord):.2f})')" ] }, { "cell_type": "markdown", "id": "9ab34ef2", "metadata": { "id": "9ab34ef2" }, "source": [ "## Masking part of the backbone coordinates\n", "To partially mask backbone coordinates, simply set the masked coordinates to np.inf. \n", "\n", "Typically, the sequence perplexity will be higher on masked regions than on unmasked regions." ] }, { "cell_type": "code", "execution_count": 12, "id": "da7f4c7a", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "da7f4c7a", "outputId": "16613c79-8060-47b5-deb1-10ab5d157dbf", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average log-likelihood on entire sequence: -1.64 (perplexity 5.16)\n", "average log-likelihood excluding missing coordinates: -1.59 (perplexity 4.92)\n" ] } ], "source": [ "from copy import deepcopy\n", "masked_coords = deepcopy(coords)\n", "masked_coords[:15] = np.inf # mask the first 10 residues\n", "ll_fullseq, ll_withcoord = esm.inverse_folding.util.score_sequence(model, alphabet, masked_coords, native_seq)\n", "\n", "print(f'average log-likelihood on entire sequence: {ll_fullseq:.2f} (perplexity {np.exp(-ll_fullseq):.2f})')\n", "print(f'average log-likelihood excluding missing coordinates: {ll_withcoord:.2f} (perplexity {np.exp(-ll_withcoord):.2f})')" ] }, { "cell_type": "markdown", "id": "17cb6adc", "metadata": { "id": "17cb6adc" }, "source": [ "## Extract encoder output as structure representation\n", "The encoder output may also be used as a representation for the structure.\n", "\n", "For a set of input coordinates with L amino acids, the encoder output will have shape L x 512." ] }, { "cell_type": "code", "execution_count": 13, "id": "0561236d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0561236d", "outputId": "a07279d1-a668-439c-8d7e-00b221bdf6d2" }, "outputs": [ { "data": { "text/plain": [ "(395, torch.Size([395, 512]))" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rep = esm.inverse_folding.util.get_encoder_output(model, alphabet, coords)\n", "len(coords), rep.shape" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "Inverse Folding with ESM-IF1.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }