{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# **## PIPES-M: Protease Inhibitor Prediction Using Evolutionary Scale Modeling (ESM-2)**\n", "\n", "## Overview\n", "\n", "This Google Colab notebook provides a user-friendly interface for inference with **PIPES-M**, a deep learning-based binary classifier designed to predict protease inhibitor (PI) activity from primary protein sequences.\n", "\n", "PIPES-M enables rapid screening of small secreted protease inhibitors (<250 amino acids) in large-scale genomic, transcriptomic, or proteomic datasets, where experimental validation is resource-intensive.\n", "\n", "The model assigns each input sequence to one of two classes: \n", "- **Positive (Potential PI)**: Predicted to exhibit protease inhibitor activity \n", "- **Negative (Non-PI)**: Predicted to lack protease inhibitor activity \n", "\n", "Output includes: \n", "- Probability of the positive class (`prob_class_1`): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity) \n", "- Confidence score: probability of the predicted class \n", "\n", "## Model Architecture and Training\n", "\n", "PIPES-M is a fine-tuned sequence classification model built on the **ESM-2** protein language model: \n", "- Base model: `facebook/esm2_t30_150M_UR50D` (150 million parameters, 30 layers) \n", "- Pre-trained on UniRef50 via masked language modeling \n", "\n", "Fine-tuning was performed on a high-quality curated dataset comprising: \n", "- Positive examples: known protease inhibitors (<250 AA) from the MEROPS database \n", "- Negative examples: non-inhibitors selected from UniProt using sequence similarity and Pfam domain analysis \n", "\n", "Training used sequence-only input, requiring no structural data. The classification head leverages evolutionary and physicochemical features encoded by ESM-2. \n", "\n", "Maximum sequence length is fixed at 250 residues; longer sequences are truncated from the N-terminus, appropriate for the typical size range of small secreted inhibitors.\n", "\n", "## Input Requirements\n", "\n", "- Multi-FASTA formatted file containing one or more protein sequences \n", "- Sequences must use standard single-letter amino acid codes \n", "- FASTA headers (lines beginning with `>`) are retained for identification \n", "\n", "## Output Columns\n", "\n", "- `header`: Original FASTA identifier \n", "- `predicted_class`: \"Positive (Potential PI)\" or \"Negative (Non-PI)\" \n", "- `confidence`: Probability of the assigned class \n", "- `prob_class_1`: Raw probability of protease inhibitor activity \n", "- `prob_class_0`: Probability of the negative class \n", "\n", "## Usage Notes\n", "\n", "- Intended for research and high-throughput screening \n", "- Positive predictions suggest potential PI activity and warrant experimental follow-up \n", "- Optimal performance is achieved on secreted or extracellular proteins, reflecting the composition of the training data \n", "- Predictions rely solely on the provided sequence; no homology search or multiple sequence alignment is performed \n", "\n", "## Model Availability\n", "\n", "The fine-tuned PIPES-M model is publicly hosted on Hugging Face: \n", "https://huggingface.co/MuthuS97/PIPES-M\n", "\n", "## Citation\n", "\n", "When using PIPES-M in research, please reference the model repository and any associated forthcoming publication.\n", "\n", "---\n", "\n", "**Instructions** \n", "1. Enable GPU acceleration: Runtime → Change runtime type → Hardware accelerator → GPU (T4 recommended). \n", "2. Execute all cells in sequence (Runtime → Run all). \n", "3. Upload your multi-FASTA file in the designated section to obtain predictions." ], "metadata": { "id": "HXIULYjtVADA" } }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nS8lo9EWRYQ5", "outputId": "4e8008e9-7048-4377-a291-cbc2165293de" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Required packages installed successfully\n" ] } ], "source": [ "# @title 0. Install Required Packages\n", "\n", "!pip install --quiet transformers huggingface_hub\n", "\n", "print(\"Required packages installed successfully\")" ] }, { "cell_type": "code", "source": [ "# @title 1. Initialization and Setup\n", "\n", "mount_drive = True # @param {type:\"boolean\"}\n", "if mount_drive:\n", " from google.colab import drive\n", " drive.mount('/content/drive')\n", " print(\"Google Drive mounted at /content/drive\")\n", "\n", "MAX_LEN = 250 # @param {type:\"integer\"}\n", "BATCH_SIZE = 16 # @param {type:\"integer\"}\n", "\n", "import torch\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "print(f\"Using device: {device}\")\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from IPython.display import display, HTML\n", "from google.colab import files\n", "\n", "print(\"Initialization complete\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1-COdhW1Thl4", "outputId": "f451fa6a-baa1-456d-81d1-a1b1b52d64e4" }, "execution_count": 14, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n", "Google Drive mounted at /content/drive\n", "Using device: cuda\n", "Initialization complete\n" ] } ] }, { "cell_type": "code", "source": [ "# @title 2. Load PIPES-M Model\n", "\n", "from transformers import AutoTokenizer, EsmForSequenceClassification\n", "\n", "MODEL_ID = \"MuthuS97/PIPES-M\"\n", "\n", "print(f\"Loading tokenizer and model from {MODEL_ID}\")\n", "tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n", "model = EsmForSequenceClassification.from_pretrained(MODEL_ID)\n", "\n", "model.to(device)\n", "model.eval()\n", "\n", "print(\"Model loaded successfully\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8FgPxVrQT_z_", "outputId": "12fee169-e8d7-49f7-9812-5d7601aafa03" }, "execution_count": 15, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Loading tokenizer and model from MuthuS97/PIPES-M\n", "Model loaded successfully\n" ] } ] }, { "cell_type": "code", "source": [ "# @title 3. Upload Multi-FASTA File\n", "\n", "uploaded = files.upload()\n", "\n", "if not uploaded:\n", " raise ValueError(\"No file uploaded. Please provide a multi-FASTA file.\")\n", "\n", "fasta_filename = list(uploaded.keys())[0]\n", "print(f\"Uploaded file: {fasta_filename}\")\n", "\n", "def parse_fasta(content):\n", " headers = []\n", " sequences = []\n", " current_seq = []\n", " current_header = None\n", "\n", " for line in content.splitlines():\n", " line = line.strip()\n", " if line.startswith(\">\"):\n", " if current_header is not None:\n", " sequences.append(\"\".join(current_seq).upper().replace(\" \", \"\"))\n", " current_seq = []\n", " current_header = line[1:].strip()\n", " headers.append(current_header)\n", " else:\n", " if line:\n", " current_seq.append(line.upper().replace(\" \", \"\"))\n", "\n", " if current_header is not None:\n", " sequences.append(\"\".join(current_seq).upper().replace(\" \", \"\"))\n", "\n", " if len(headers) != len(sequences):\n", " raise ValueError(\"Parsing error: number of headers and sequences do not match\")\n", "\n", " return pd.DataFrame({\"header\": headers, \"sequence\": sequences})\n", "\n", "with open(fasta_filename, \"r\") as f:\n", " fasta_content = f.read()\n", "\n", "df = parse_fasta(fasta_content)\n", "print(f\"Loaded {len(df)} sequences\")\n", "\n", "long_seqs = df[df['sequence'].str.len() > MAX_LEN]\n", "if len(long_seqs) > 0:\n", " print(f\"Warning: {len(long_seqs)} sequences exceed {MAX_LEN} residues and will be truncated\")\n", "\n", "display(df.head())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 223 }, "id": "p_AfPGPNUQSU", "outputId": "65cc14f7-943f-4a3c-bb46-47b52d427a74" }, "execution_count": 16, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "\n", " \n", " \n", " Upload widget is only available when the cell has been executed in the\n", " current browser session. Please rerun this cell to enable.\n", " \n", " " ] }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "Saving rcsb_pdb_6TME.fasta to rcsb_pdb_6TME.fasta\n", "Uploaded file: rcsb_pdb_6TME.fasta\n", "Loaded 2 sequences\n", "Warning: 1 sequences exceed 250 residues and will be truncated\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ " header \\\n", "0 6TME_1|Chains A, B|Pollen-specific leucine-ric... \n", "1 6TME_2|Chains C, D|Protein RALF-like 4|Arabido... \n", "\n", " sequence \n", "0 MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKR... \n", "1 ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAIT... " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headersequence
06TME_1|Chains A, B|Pollen-specific leucine-ric...MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKR...
16TME_2|Chains C, D|Protein RALF-like 4|Arabido...ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAIT...
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"header\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"6TME_2|Chains C, D|Protein RALF-like 4|Arabidopsis thaliana (3702)\",\n \"6TME_1|Chains A, B|Pollen-specific leucine-rich repeat extensin-like protein 1|Arabidopsis thaliana (3702)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sequence\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAITHCYR\",\n \"MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKRAYIALQAWKKAFYSDPFNTAANWVGPDVCSYKGVFCAPALDDPSVLVVAGIDLNHADIFGYLPPELGLLTDVALFHVNSNRFCGVIPKSLSKLTLMYEFDVSNNRFVGPFPTVALSWPSLKFLDIRYNDFEGKLPPEIFDKDLDAIFLNNNRFESTIPETIGKSTASVVTFAHNKFSGCIPKTIGQMKNLNEIVFIGNNLSGCLPNEIGSLNNVTVFDASSNGFVGSLPSTLSGLANVEQMDFSYNKFTGFVTDNICKLPKLSNFTFSYNFFNGEAQSCVPGSSQEKQFDDTSNCLQNRPNQKSAKECLPVVSRPVDCSKDKCAGG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "# @title 4. Run Inference\n", "\n", "from torch.utils.data import DataLoader, TensorDataset\n", "\n", "print(\"Tokenizing sequences\")\n", "sequences = df['sequence'].tolist()\n", "encoded = tokenizer(\n", " sequences,\n", " padding=True,\n", " truncation=True,\n", " max_length=MAX_LEN,\n", " return_tensors=\"pt\"\n", ")\n", "\n", "dataset = TensorDataset(encoded['input_ids'], encoded['attention_mask'])\n", "dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)\n", "\n", "all_probs = []\n", "all_preds = []\n", "\n", "print(\"Running inference\")\n", "with torch.no_grad():\n", " for i, batch in enumerate(dataloader):\n", " input_ids, attention_mask = [b.to(device) for b in batch]\n", " outputs = model(input_ids=input_ids, attention_mask=attention_mask)\n", " logits = outputs.logits\n", " probs = torch.softmax(logits, dim=1).cpu().numpy()\n", " preds = np.argmax(probs, axis=1)\n", " all_probs.extend(probs)\n", " all_preds.extend(preds)\n", "\n", " if (i + 1) % 10 == 0 or (i + 1) == len(dataloader):\n", " processed = min((i + 1) * BATCH_SIZE, len(sequences))\n", " print(f\"Processed {processed} of {len(sequences)} sequences\")\n", "\n", "print(\"Inference completed\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nwHd1DRVUn_e", "outputId": "96ebfb56-ae1c-4254-8476-c0814b924b13" }, "execution_count": 17, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Tokenizing sequences\n", "Running inference\n", "Processed 2 of 2 sequences\n", "Inference completed\n" ] } ] }, { "cell_type": "code", "source": [ "# @title 5. Results and Download\n", "\n", "confidence = [p[pred] for p, pred in zip(all_probs, all_preds)]\n", "df['predicted_class_id'] = all_preds\n", "df['confidence'] = confidence\n", "df['prob_class_0'] = [p[0] for p in all_probs]\n", "df['prob_class_1'] = [p[1] for p in all_probs]\n", "\n", "df['predicted_class'] = df['predicted_class_id'].map({\n", " 0: \"Negative (Non-PI)\",\n", " 1: \"Positive (Potential PI)\"\n", "})\n", "\n", "display(HTML(\"

Prediction Results (first 10 sequences)

\"))\n", "display(df[['header', 'predicted_class', 'confidence', 'prob_class_1']].head(10))\n", "\n", "print(\"\\nClass distribution\")\n", "counts = df['predicted_class'].value_counts()\n", "for label, count in counts.items():\n", " percentage = count / len(df) * 100\n", " print(f\"{label}: {count} sequences ({percentage:.1f}%)\")\n", "\n", "output_csv = \"PIPES-M_predictions.csv\"\n", "df.to_csv(output_csv, index=False)\n", "\n", "if mount_drive:\n", " drive_path = \"/content/drive/MyDrive/PIPES-M_predictions.csv\"\n", " df.to_csv(drive_path, index=False)\n", " print(f\"\\nResults also saved to Google Drive: {drive_path}\")\n", "\n", "print(f\"\\nResults saved as {output_csv}\")\n", "files.download(output_csv)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 278 }, "id": "A3fPg8TaUu2k", "outputId": "bdd02de6-60a6-4236-d09b-e7af9319fc8e" }, "execution_count": 18, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "

Prediction Results (first 10 sequences)

" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ " header predicted_class \\\n", "0 6TME_1|Chains A, B|Pollen-specific leucine-ric... Positive (Potential PI) \n", "1 6TME_2|Chains C, D|Protein RALF-like 4|Arabido... Positive (Potential PI) \n", "\n", " confidence prob_class_1 \n", "0 0.947041 0.947041 \n", "1 0.965963 0.965963 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headerpredicted_classconfidenceprob_class_1
06TME_1|Chains A, B|Pollen-specific leucine-ric...Positive (Potential PI)0.9470410.947041
16TME_2|Chains C, D|Protein RALF-like 4|Arabido...Positive (Potential PI)0.9659630.965963
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"files\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"header\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"6TME_2|Chains C, D|Protein RALF-like 4|Arabidopsis thaliana (3702)\",\n \"6TME_1|Chains A, B|Pollen-specific leucine-rich repeat extensin-like protein 1|Arabidopsis thaliana (3702)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_class\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Positive (Potential PI)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"confidence\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 2,\n \"samples\": [\n 0.9659631848335266\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"prob_class_1\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 2,\n \"samples\": [\n 0.9659631848335266\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "\n", "Class distribution\n", "Positive (Potential PI): 2 sequences (100.0%)\n", "\n", "Results also saved to Google Drive: /content/drive/MyDrive/PIPES-M_predictions.csv\n", "\n", "Results saved as PIPES-M_predictions.csv\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "application/javascript": [ "\n", " async function download(id, filename, size) {\n", " if (!google.colab.kernel.accessAllowed) {\n", " return;\n", " }\n", " const div = document.createElement('div');\n", " const label = document.createElement('label');\n", " label.textContent = `Downloading \"${filename}\": `;\n", " div.appendChild(label);\n", " const progress = document.createElement('progress');\n", " progress.max = size;\n", " div.appendChild(progress);\n", " document.body.appendChild(div);\n", "\n", " const buffers = [];\n", " let downloaded = 0;\n", "\n", " const channel = await google.colab.kernel.comms.open(id);\n", " // Send a message to notify the kernel that we're ready.\n", " channel.send({})\n", "\n", " for await (const message of channel.messages) {\n", " // Send a message to notify the kernel that we're ready.\n", " channel.send({})\n", " if (message.buffers) {\n", " for (const buffer of message.buffers) {\n", " buffers.push(buffer);\n", " downloaded += buffer.byteLength;\n", " progress.value = downloaded;\n", " }\n", " }\n", " }\n", " const blob = new Blob(buffers, {type: 'application/binary'});\n", " const a = document.createElement('a');\n", " a.href = window.URL.createObjectURL(blob);\n", " a.download = filename;\n", " div.appendChild(a);\n", " a.click();\n", " div.remove();\n", " }\n", " " ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "application/javascript": [ "download(\"download_b408fcdf-a1a5-4daf-973f-d965a8b95af4\", \"PIPES-M_predictions.csv\", 807)" ] }, "metadata": {} } ] } ] }