MuthuS97 commited on
Commit
3b82921
·
verified ·
1 Parent(s): 1e3a48a

Upload PIPES_M.ipynb

Browse files
Files changed (1) hide show
  1. PIPES_M.ipynb +1211 -0
PIPES_M.ipynb ADDED
@@ -0,0 +1,1211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "gpuType": "T4"
8
+ },
9
+ "kernelspec": {
10
+ "name": "python3",
11
+ "display_name": "Python 3"
12
+ },
13
+ "language_info": {
14
+ "name": "python"
15
+ },
16
+ "accelerator": "GPU"
17
+ },
18
+ "cells": [
19
+ {
20
+ "cell_type": "markdown",
21
+ "source": [
22
+ "# **## PIPES-M: Protease Inhibitor Prediction Using Evolutionary Scale Modeling (ESM-2)**\n",
23
+ "\n",
24
+ "## Overview\n",
25
+ "\n",
26
+ "This Google Colab notebook provides a user-friendly interface for inference with **PIPES-M**, a deep learning-based binary classifier designed to predict protease inhibitor (PI) activity from primary protein sequences.\n",
27
+ "\n",
28
+ "PIPES-M enables rapid screening of small secreted protease inhibitors (<250 amino acids) in large-scale genomic, transcriptomic, or proteomic datasets, where experimental validation is resource-intensive.\n",
29
+ "\n",
30
+ "The model assigns each input sequence to one of two classes: \n",
31
+ "- **Positive (Potential PI)**: Predicted to exhibit protease inhibitor activity \n",
32
+ "- **Negative (Non-PI)**: Predicted to lack protease inhibitor activity \n",
33
+ "\n",
34
+ "Output includes: \n",
35
+ "- Probability of the positive class (`prob_class_1`): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity) \n",
36
+ "- Confidence score: probability of the predicted class \n",
37
+ "\n",
38
+ "## Model Architecture and Training\n",
39
+ "\n",
40
+ "PIPES-M is a fine-tuned sequence classification model built on the **ESM-2** protein language model: \n",
41
+ "- Base model: `facebook/esm2_t30_150M_UR50D` (150 million parameters, 30 layers) \n",
42
+ "- Pre-trained on UniRef50 via masked language modeling \n",
43
+ "\n",
44
+ "Fine-tuning was performed on a high-quality curated dataset comprising: \n",
45
+ "- Positive examples: known protease inhibitors (<250 AA) from the MEROPS database \n",
46
+ "- Negative examples: non-inhibitors selected from UniProt using sequence similarity and Pfam domain analysis \n",
47
+ "\n",
48
+ "Training used sequence-only input, requiring no structural data. The classification head leverages evolutionary and physicochemical features encoded by ESM-2. \n",
49
+ "\n",
50
+ "Maximum sequence length is fixed at 250 residues; longer sequences are truncated from the N-terminus, appropriate for the typical size range of small secreted inhibitors.\n",
51
+ "\n",
52
+ "## Input Requirements\n",
53
+ "\n",
54
+ "- Multi-FASTA formatted file containing one or more protein sequences \n",
55
+ "- Sequences must use standard single-letter amino acid codes \n",
56
+ "- FASTA headers (lines beginning with `>`) are retained for identification \n",
57
+ "\n",
58
+ "## Output Columns\n",
59
+ "\n",
60
+ "- `header`: Original FASTA identifier \n",
61
+ "- `predicted_class`: \"Positive (Potential PI)\" or \"Negative (Non-PI)\" \n",
62
+ "- `confidence`: Probability of the assigned class \n",
63
+ "- `prob_class_1`: Raw probability of protease inhibitor activity \n",
64
+ "- `prob_class_0`: Probability of the negative class \n",
65
+ "\n",
66
+ "## Usage Notes\n",
67
+ "\n",
68
+ "- Intended for research and high-throughput screening \n",
69
+ "- Positive predictions suggest potential PI activity and warrant experimental follow-up \n",
70
+ "- Optimal performance is achieved on secreted or extracellular proteins, reflecting the composition of the training data \n",
71
+ "- Predictions rely solely on the provided sequence; no homology search or multiple sequence alignment is performed \n",
72
+ "\n",
73
+ "## Model Availability\n",
74
+ "\n",
75
+ "The fine-tuned PIPES-M model is publicly hosted on Hugging Face: \n",
76
+ "https://huggingface.co/MuthuS97/PIPES-M\n",
77
+ "\n",
78
+ "## Citation\n",
79
+ "\n",
80
+ "When using PIPES-M in research, please reference the model repository and any associated forthcoming publication.\n",
81
+ "\n",
82
+ "---\n",
83
+ "\n",
84
+ "**Instructions** \n",
85
+ "1. Enable GPU acceleration: Runtime → Change runtime type → Hardware accelerator → GPU (T4 recommended). \n",
86
+ "2. Execute all cells in sequence (Runtime → Run all). \n",
87
+ "3. Upload your multi-FASTA file in the designated section to obtain predictions."
88
+ ],
89
+ "metadata": {
90
+ "id": "HXIULYjtVADA"
91
+ }
92
+ },
93
+ {
94
+ "cell_type": "code",
95
+ "execution_count": 13,
96
+ "metadata": {
97
+ "colab": {
98
+ "base_uri": "https://localhost:8080/"
99
+ },
100
+ "id": "nS8lo9EWRYQ5",
101
+ "outputId": "4e8008e9-7048-4377-a291-cbc2165293de"
102
+ },
103
+ "outputs": [
104
+ {
105
+ "output_type": "stream",
106
+ "name": "stdout",
107
+ "text": [
108
+ "Required packages installed successfully\n"
109
+ ]
110
+ }
111
+ ],
112
+ "source": [
113
+ "# @title 0. Install Required Packages\n",
114
+ "\n",
115
+ "!pip install --quiet transformers huggingface_hub\n",
116
+ "\n",
117
+ "print(\"Required packages installed successfully\")"
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "source": [
123
+ "# @title 1. Initialization and Setup\n",
124
+ "\n",
125
+ "mount_drive = True # @param {type:\"boolean\"}\n",
126
+ "if mount_drive:\n",
127
+ " from google.colab import drive\n",
128
+ " drive.mount('/content/drive')\n",
129
+ " print(\"Google Drive mounted at /content/drive\")\n",
130
+ "\n",
131
+ "MAX_LEN = 250 # @param {type:\"integer\"}\n",
132
+ "BATCH_SIZE = 16 # @param {type:\"integer\"}\n",
133
+ "\n",
134
+ "import torch\n",
135
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
136
+ "print(f\"Using device: {device}\")\n",
137
+ "\n",
138
+ "import pandas as pd\n",
139
+ "import numpy as np\n",
140
+ "from IPython.display import display, HTML\n",
141
+ "from google.colab import files\n",
142
+ "\n",
143
+ "print(\"Initialization complete\")"
144
+ ],
145
+ "metadata": {
146
+ "colab": {
147
+ "base_uri": "https://localhost:8080/"
148
+ },
149
+ "id": "1-COdhW1Thl4",
150
+ "outputId": "f451fa6a-baa1-456d-81d1-a1b1b52d64e4"
151
+ },
152
+ "execution_count": 14,
153
+ "outputs": [
154
+ {
155
+ "output_type": "stream",
156
+ "name": "stdout",
157
+ "text": [
158
+ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n",
159
+ "Google Drive mounted at /content/drive\n",
160
+ "Using device: cuda\n",
161
+ "Initialization complete\n"
162
+ ]
163
+ }
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "source": [
169
+ "# @title 2. Load PIPES-M Model\n",
170
+ "\n",
171
+ "from transformers import AutoTokenizer, EsmForSequenceClassification\n",
172
+ "\n",
173
+ "MODEL_ID = \"MuthuS97/PIPES-M\"\n",
174
+ "\n",
175
+ "print(f\"Loading tokenizer and model from {MODEL_ID}\")\n",
176
+ "tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n",
177
+ "model = EsmForSequenceClassification.from_pretrained(MODEL_ID)\n",
178
+ "\n",
179
+ "model.to(device)\n",
180
+ "model.eval()\n",
181
+ "\n",
182
+ "print(\"Model loaded successfully\")"
183
+ ],
184
+ "metadata": {
185
+ "colab": {
186
+ "base_uri": "https://localhost:8080/"
187
+ },
188
+ "id": "8FgPxVrQT_z_",
189
+ "outputId": "12fee169-e8d7-49f7-9812-5d7601aafa03"
190
+ },
191
+ "execution_count": 15,
192
+ "outputs": [
193
+ {
194
+ "output_type": "stream",
195
+ "name": "stdout",
196
+ "text": [
197
+ "Loading tokenizer and model from MuthuS97/PIPES-M\n",
198
+ "Model loaded successfully\n"
199
+ ]
200
+ }
201
+ ]
202
+ },
203
+ {
204
+ "cell_type": "code",
205
+ "source": [
206
+ "# @title 3. Upload Multi-FASTA File\n",
207
+ "\n",
208
+ "uploaded = files.upload()\n",
209
+ "\n",
210
+ "if not uploaded:\n",
211
+ " raise ValueError(\"No file uploaded. Please provide a multi-FASTA file.\")\n",
212
+ "\n",
213
+ "fasta_filename = list(uploaded.keys())[0]\n",
214
+ "print(f\"Uploaded file: {fasta_filename}\")\n",
215
+ "\n",
216
+ "def parse_fasta(content):\n",
217
+ " headers = []\n",
218
+ " sequences = []\n",
219
+ " current_seq = []\n",
220
+ " current_header = None\n",
221
+ "\n",
222
+ " for line in content.splitlines():\n",
223
+ " line = line.strip()\n",
224
+ " if line.startswith(\">\"):\n",
225
+ " if current_header is not None:\n",
226
+ " sequences.append(\"\".join(current_seq).upper().replace(\" \", \"\"))\n",
227
+ " current_seq = []\n",
228
+ " current_header = line[1:].strip()\n",
229
+ " headers.append(current_header)\n",
230
+ " else:\n",
231
+ " if line:\n",
232
+ " current_seq.append(line.upper().replace(\" \", \"\"))\n",
233
+ "\n",
234
+ " if current_header is not None:\n",
235
+ " sequences.append(\"\".join(current_seq).upper().replace(\" \", \"\"))\n",
236
+ "\n",
237
+ " if len(headers) != len(sequences):\n",
238
+ " raise ValueError(\"Parsing error: number of headers and sequences do not match\")\n",
239
+ "\n",
240
+ " return pd.DataFrame({\"header\": headers, \"sequence\": sequences})\n",
241
+ "\n",
242
+ "with open(fasta_filename, \"r\") as f:\n",
243
+ " fasta_content = f.read()\n",
244
+ "\n",
245
+ "df = parse_fasta(fasta_content)\n",
246
+ "print(f\"Loaded {len(df)} sequences\")\n",
247
+ "\n",
248
+ "long_seqs = df[df['sequence'].str.len() > MAX_LEN]\n",
249
+ "if len(long_seqs) > 0:\n",
250
+ " print(f\"Warning: {len(long_seqs)} sequences exceed {MAX_LEN} residues and will be truncated\")\n",
251
+ "\n",
252
+ "display(df.head())"
253
+ ],
254
+ "metadata": {
255
+ "colab": {
256
+ "base_uri": "https://localhost:8080/",
257
+ "height": 223
258
+ },
259
+ "id": "p_AfPGPNUQSU",
260
+ "outputId": "65cc14f7-943f-4a3c-bb46-47b52d427a74"
261
+ },
262
+ "execution_count": 16,
263
+ "outputs": [
264
+ {
265
+ "output_type": "display_data",
266
+ "data": {
267
+ "text/plain": [
268
+ "<IPython.core.display.HTML object>"
269
+ ],
270
+ "text/html": [
271
+ "\n",
272
+ " <input type=\"file\" id=\"files-c7fd2126-bf42-4ce3-87e1-f21e14a082bd\" name=\"files[]\" multiple disabled\n",
273
+ " style=\"border:none\" />\n",
274
+ " <output id=\"result-c7fd2126-bf42-4ce3-87e1-f21e14a082bd\">\n",
275
+ " Upload widget is only available when the cell has been executed in the\n",
276
+ " current browser session. Please rerun this cell to enable.\n",
277
+ " </output>\n",
278
+ " <script>// Copyright 2017 Google LLC\n",
279
+ "//\n",
280
+ "// Licensed under the Apache License, Version 2.0 (the \"License\");\n",
281
+ "// you may not use this file except in compliance with the License.\n",
282
+ "// You may obtain a copy of the License at\n",
283
+ "//\n",
284
+ "// http://www.apache.org/licenses/LICENSE-2.0\n",
285
+ "//\n",
286
+ "// Unless required by applicable law or agreed to in writing, software\n",
287
+ "// distributed under the License is distributed on an \"AS IS\" BASIS,\n",
288
+ "// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
289
+ "// See the License for the specific language governing permissions and\n",
290
+ "// limitations under the License.\n",
291
+ "\n",
292
+ "/**\n",
293
+ " * @fileoverview Helpers for google.colab Python module.\n",
294
+ " */\n",
295
+ "(function(scope) {\n",
296
+ "function span(text, styleAttributes = {}) {\n",
297
+ " const element = document.createElement('span');\n",
298
+ " element.textContent = text;\n",
299
+ " for (const key of Object.keys(styleAttributes)) {\n",
300
+ " element.style[key] = styleAttributes[key];\n",
301
+ " }\n",
302
+ " return element;\n",
303
+ "}\n",
304
+ "\n",
305
+ "// Max number of bytes which will be uploaded at a time.\n",
306
+ "const MAX_PAYLOAD_SIZE = 100 * 1024;\n",
307
+ "\n",
308
+ "function _uploadFiles(inputId, outputId) {\n",
309
+ " const steps = uploadFilesStep(inputId, outputId);\n",
310
+ " const outputElement = document.getElementById(outputId);\n",
311
+ " // Cache steps on the outputElement to make it available for the next call\n",
312
+ " // to uploadFilesContinue from Python.\n",
313
+ " outputElement.steps = steps;\n",
314
+ "\n",
315
+ " return _uploadFilesContinue(outputId);\n",
316
+ "}\n",
317
+ "\n",
318
+ "// This is roughly an async generator (not supported in the browser yet),\n",
319
+ "// where there are multiple asynchronous steps and the Python side is going\n",
320
+ "// to poll for completion of each step.\n",
321
+ "// This uses a Promise to block the python side on completion of each step,\n",
322
+ "// then passes the result of the previous step as the input to the next step.\n",
323
+ "function _uploadFilesContinue(outputId) {\n",
324
+ " const outputElement = document.getElementById(outputId);\n",
325
+ " const steps = outputElement.steps;\n",
326
+ "\n",
327
+ " const next = steps.next(outputElement.lastPromiseValue);\n",
328
+ " return Promise.resolve(next.value.promise).then((value) => {\n",
329
+ " // Cache the last promise value to make it available to the next\n",
330
+ " // step of the generator.\n",
331
+ " outputElement.lastPromiseValue = value;\n",
332
+ " return next.value.response;\n",
333
+ " });\n",
334
+ "}\n",
335
+ "\n",
336
+ "/**\n",
337
+ " * Generator function which is called between each async step of the upload\n",
338
+ " * process.\n",
339
+ " * @param {string} inputId Element ID of the input file picker element.\n",
340
+ " * @param {string} outputId Element ID of the output display.\n",
341
+ " * @return {!Iterable<!Object>} Iterable of next steps.\n",
342
+ " */\n",
343
+ "function* uploadFilesStep(inputId, outputId) {\n",
344
+ " const inputElement = document.getElementById(inputId);\n",
345
+ " inputElement.disabled = false;\n",
346
+ "\n",
347
+ " const outputElement = document.getElementById(outputId);\n",
348
+ " outputElement.innerHTML = '';\n",
349
+ "\n",
350
+ " const pickedPromise = new Promise((resolve) => {\n",
351
+ " inputElement.addEventListener('change', (e) => {\n",
352
+ " resolve(e.target.files);\n",
353
+ " });\n",
354
+ " });\n",
355
+ "\n",
356
+ " const cancel = document.createElement('button');\n",
357
+ " inputElement.parentElement.appendChild(cancel);\n",
358
+ " cancel.textContent = 'Cancel upload';\n",
359
+ " const cancelPromise = new Promise((resolve) => {\n",
360
+ " cancel.onclick = () => {\n",
361
+ " resolve(null);\n",
362
+ " };\n",
363
+ " });\n",
364
+ "\n",
365
+ " // Wait for the user to pick the files.\n",
366
+ " const files = yield {\n",
367
+ " promise: Promise.race([pickedPromise, cancelPromise]),\n",
368
+ " response: {\n",
369
+ " action: 'starting',\n",
370
+ " }\n",
371
+ " };\n",
372
+ "\n",
373
+ " cancel.remove();\n",
374
+ "\n",
375
+ " // Disable the input element since further picks are not allowed.\n",
376
+ " inputElement.disabled = true;\n",
377
+ "\n",
378
+ " if (!files) {\n",
379
+ " return {\n",
380
+ " response: {\n",
381
+ " action: 'complete',\n",
382
+ " }\n",
383
+ " };\n",
384
+ " }\n",
385
+ "\n",
386
+ " for (const file of files) {\n",
387
+ " const li = document.createElement('li');\n",
388
+ " li.append(span(file.name, {fontWeight: 'bold'}));\n",
389
+ " li.append(span(\n",
390
+ " `(${file.type || 'n/a'}) - ${file.size} bytes, ` +\n",
391
+ " `last modified: ${\n",
392
+ " file.lastModifiedDate ? file.lastModifiedDate.toLocaleDateString() :\n",
393
+ " 'n/a'} - `));\n",
394
+ " const percent = span('0% done');\n",
395
+ " li.appendChild(percent);\n",
396
+ "\n",
397
+ " outputElement.appendChild(li);\n",
398
+ "\n",
399
+ " const fileDataPromise = new Promise((resolve) => {\n",
400
+ " const reader = new FileReader();\n",
401
+ " reader.onload = (e) => {\n",
402
+ " resolve(e.target.result);\n",
403
+ " };\n",
404
+ " reader.readAsArrayBuffer(file);\n",
405
+ " });\n",
406
+ " // Wait for the data to be ready.\n",
407
+ " let fileData = yield {\n",
408
+ " promise: fileDataPromise,\n",
409
+ " response: {\n",
410
+ " action: 'continue',\n",
411
+ " }\n",
412
+ " };\n",
413
+ "\n",
414
+ " // Use a chunked sending to avoid message size limits. See b/62115660.\n",
415
+ " let position = 0;\n",
416
+ " do {\n",
417
+ " const length = Math.min(fileData.byteLength - position, MAX_PAYLOAD_SIZE);\n",
418
+ " const chunk = new Uint8Array(fileData, position, length);\n",
419
+ " position += length;\n",
420
+ "\n",
421
+ " const base64 = btoa(String.fromCharCode.apply(null, chunk));\n",
422
+ " yield {\n",
423
+ " response: {\n",
424
+ " action: 'append',\n",
425
+ " file: file.name,\n",
426
+ " data: base64,\n",
427
+ " },\n",
428
+ " };\n",
429
+ "\n",
430
+ " let percentDone = fileData.byteLength === 0 ?\n",
431
+ " 100 :\n",
432
+ " Math.round((position / fileData.byteLength) * 100);\n",
433
+ " percent.textContent = `${percentDone}% done`;\n",
434
+ "\n",
435
+ " } while (position < fileData.byteLength);\n",
436
+ " }\n",
437
+ "\n",
438
+ " // All done.\n",
439
+ " yield {\n",
440
+ " response: {\n",
441
+ " action: 'complete',\n",
442
+ " }\n",
443
+ " };\n",
444
+ "}\n",
445
+ "\n",
446
+ "scope.google = scope.google || {};\n",
447
+ "scope.google.colab = scope.google.colab || {};\n",
448
+ "scope.google.colab._files = {\n",
449
+ " _uploadFiles,\n",
450
+ " _uploadFilesContinue,\n",
451
+ "};\n",
452
+ "})(self);\n",
453
+ "</script> "
454
+ ]
455
+ },
456
+ "metadata": {}
457
+ },
458
+ {
459
+ "output_type": "stream",
460
+ "name": "stdout",
461
+ "text": [
462
+ "Saving rcsb_pdb_6TME.fasta to rcsb_pdb_6TME.fasta\n",
463
+ "Uploaded file: rcsb_pdb_6TME.fasta\n",
464
+ "Loaded 2 sequences\n",
465
+ "Warning: 1 sequences exceed 250 residues and will be truncated\n"
466
+ ]
467
+ },
468
+ {
469
+ "output_type": "display_data",
470
+ "data": {
471
+ "text/plain": [
472
+ " header \\\n",
473
+ "0 6TME_1|Chains A, B|Pollen-specific leucine-ric... \n",
474
+ "1 6TME_2|Chains C, D|Protein RALF-like 4|Arabido... \n",
475
+ "\n",
476
+ " sequence \n",
477
+ "0 MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKR... \n",
478
+ "1 ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAIT... "
479
+ ],
480
+ "text/html": [
481
+ "\n",
482
+ " <div id=\"df-360f9a8b-05ae-47e8-af43-319d0dbf4606\" class=\"colab-df-container\">\n",
483
+ " <div>\n",
484
+ "<style scoped>\n",
485
+ " .dataframe tbody tr th:only-of-type {\n",
486
+ " vertical-align: middle;\n",
487
+ " }\n",
488
+ "\n",
489
+ " .dataframe tbody tr th {\n",
490
+ " vertical-align: top;\n",
491
+ " }\n",
492
+ "\n",
493
+ " .dataframe thead th {\n",
494
+ " text-align: right;\n",
495
+ " }\n",
496
+ "</style>\n",
497
+ "<table border=\"1\" class=\"dataframe\">\n",
498
+ " <thead>\n",
499
+ " <tr style=\"text-align: right;\">\n",
500
+ " <th></th>\n",
501
+ " <th>header</th>\n",
502
+ " <th>sequence</th>\n",
503
+ " </tr>\n",
504
+ " </thead>\n",
505
+ " <tbody>\n",
506
+ " <tr>\n",
507
+ " <th>0</th>\n",
508
+ " <td>6TME_1|Chains A, B|Pollen-specific leucine-ric...</td>\n",
509
+ " <td>MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKR...</td>\n",
510
+ " </tr>\n",
511
+ " <tr>\n",
512
+ " <th>1</th>\n",
513
+ " <td>6TME_2|Chains C, D|Protein RALF-like 4|Arabido...</td>\n",
514
+ " <td>ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAIT...</td>\n",
515
+ " </tr>\n",
516
+ " </tbody>\n",
517
+ "</table>\n",
518
+ "</div>\n",
519
+ " <div class=\"colab-df-buttons\">\n",
520
+ "\n",
521
+ " <div class=\"colab-df-container\">\n",
522
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-360f9a8b-05ae-47e8-af43-319d0dbf4606')\"\n",
523
+ " title=\"Convert this dataframe to an interactive table.\"\n",
524
+ " style=\"display:none;\">\n",
525
+ "\n",
526
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
527
+ " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
528
+ " </svg>\n",
529
+ " </button>\n",
530
+ "\n",
531
+ " <style>\n",
532
+ " .colab-df-container {\n",
533
+ " display:flex;\n",
534
+ " gap: 12px;\n",
535
+ " }\n",
536
+ "\n",
537
+ " .colab-df-convert {\n",
538
+ " background-color: #E8F0FE;\n",
539
+ " border: none;\n",
540
+ " border-radius: 50%;\n",
541
+ " cursor: pointer;\n",
542
+ " display: none;\n",
543
+ " fill: #1967D2;\n",
544
+ " height: 32px;\n",
545
+ " padding: 0 0 0 0;\n",
546
+ " width: 32px;\n",
547
+ " }\n",
548
+ "\n",
549
+ " .colab-df-convert:hover {\n",
550
+ " background-color: #E2EBFA;\n",
551
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
552
+ " fill: #174EA6;\n",
553
+ " }\n",
554
+ "\n",
555
+ " .colab-df-buttons div {\n",
556
+ " margin-bottom: 4px;\n",
557
+ " }\n",
558
+ "\n",
559
+ " [theme=dark] .colab-df-convert {\n",
560
+ " background-color: #3B4455;\n",
561
+ " fill: #D2E3FC;\n",
562
+ " }\n",
563
+ "\n",
564
+ " [theme=dark] .colab-df-convert:hover {\n",
565
+ " background-color: #434B5C;\n",
566
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
567
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
568
+ " fill: #FFFFFF;\n",
569
+ " }\n",
570
+ " </style>\n",
571
+ "\n",
572
+ " <script>\n",
573
+ " const buttonEl =\n",
574
+ " document.querySelector('#df-360f9a8b-05ae-47e8-af43-319d0dbf4606 button.colab-df-convert');\n",
575
+ " buttonEl.style.display =\n",
576
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
577
+ "\n",
578
+ " async function convertToInteractive(key) {\n",
579
+ " const element = document.querySelector('#df-360f9a8b-05ae-47e8-af43-319d0dbf4606');\n",
580
+ " const dataTable =\n",
581
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
582
+ " [key], {});\n",
583
+ " if (!dataTable) return;\n",
584
+ "\n",
585
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
586
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
587
+ " + ' to learn more about interactive tables.';\n",
588
+ " element.innerHTML = '';\n",
589
+ " dataTable['output_type'] = 'display_data';\n",
590
+ " await google.colab.output.renderOutput(dataTable, element);\n",
591
+ " const docLink = document.createElement('div');\n",
592
+ " docLink.innerHTML = docLinkHtml;\n",
593
+ " element.appendChild(docLink);\n",
594
+ " }\n",
595
+ " </script>\n",
596
+ " </div>\n",
597
+ "\n",
598
+ "\n",
599
+ " <div id=\"df-e4f19412-ddac-41a8-b87a-879d75400e74\">\n",
600
+ " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e4f19412-ddac-41a8-b87a-879d75400e74')\"\n",
601
+ " title=\"Suggest charts\"\n",
602
+ " style=\"display:none;\">\n",
603
+ "\n",
604
+ "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
605
+ " width=\"24px\">\n",
606
+ " <g>\n",
607
+ " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
608
+ " </g>\n",
609
+ "</svg>\n",
610
+ " </button>\n",
611
+ "\n",
612
+ "<style>\n",
613
+ " .colab-df-quickchart {\n",
614
+ " --bg-color: #E8F0FE;\n",
615
+ " --fill-color: #1967D2;\n",
616
+ " --hover-bg-color: #E2EBFA;\n",
617
+ " --hover-fill-color: #174EA6;\n",
618
+ " --disabled-fill-color: #AAA;\n",
619
+ " --disabled-bg-color: #DDD;\n",
620
+ " }\n",
621
+ "\n",
622
+ " [theme=dark] .colab-df-quickchart {\n",
623
+ " --bg-color: #3B4455;\n",
624
+ " --fill-color: #D2E3FC;\n",
625
+ " --hover-bg-color: #434B5C;\n",
626
+ " --hover-fill-color: #FFFFFF;\n",
627
+ " --disabled-bg-color: #3B4455;\n",
628
+ " --disabled-fill-color: #666;\n",
629
+ " }\n",
630
+ "\n",
631
+ " .colab-df-quickchart {\n",
632
+ " background-color: var(--bg-color);\n",
633
+ " border: none;\n",
634
+ " border-radius: 50%;\n",
635
+ " cursor: pointer;\n",
636
+ " display: none;\n",
637
+ " fill: var(--fill-color);\n",
638
+ " height: 32px;\n",
639
+ " padding: 0;\n",
640
+ " width: 32px;\n",
641
+ " }\n",
642
+ "\n",
643
+ " .colab-df-quickchart:hover {\n",
644
+ " background-color: var(--hover-bg-color);\n",
645
+ " box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
646
+ " fill: var(--button-hover-fill-color);\n",
647
+ " }\n",
648
+ "\n",
649
+ " .colab-df-quickchart-complete:disabled,\n",
650
+ " .colab-df-quickchart-complete:disabled:hover {\n",
651
+ " background-color: var(--disabled-bg-color);\n",
652
+ " fill: var(--disabled-fill-color);\n",
653
+ " box-shadow: none;\n",
654
+ " }\n",
655
+ "\n",
656
+ " .colab-df-spinner {\n",
657
+ " border: 2px solid var(--fill-color);\n",
658
+ " border-color: transparent;\n",
659
+ " border-bottom-color: var(--fill-color);\n",
660
+ " animation:\n",
661
+ " spin 1s steps(1) infinite;\n",
662
+ " }\n",
663
+ "\n",
664
+ " @keyframes spin {\n",
665
+ " 0% {\n",
666
+ " border-color: transparent;\n",
667
+ " border-bottom-color: var(--fill-color);\n",
668
+ " border-left-color: var(--fill-color);\n",
669
+ " }\n",
670
+ " 20% {\n",
671
+ " border-color: transparent;\n",
672
+ " border-left-color: var(--fill-color);\n",
673
+ " border-top-color: var(--fill-color);\n",
674
+ " }\n",
675
+ " 30% {\n",
676
+ " border-color: transparent;\n",
677
+ " border-left-color: var(--fill-color);\n",
678
+ " border-top-color: var(--fill-color);\n",
679
+ " border-right-color: var(--fill-color);\n",
680
+ " }\n",
681
+ " 40% {\n",
682
+ " border-color: transparent;\n",
683
+ " border-right-color: var(--fill-color);\n",
684
+ " border-top-color: var(--fill-color);\n",
685
+ " }\n",
686
+ " 60% {\n",
687
+ " border-color: transparent;\n",
688
+ " border-right-color: var(--fill-color);\n",
689
+ " }\n",
690
+ " 80% {\n",
691
+ " border-color: transparent;\n",
692
+ " border-right-color: var(--fill-color);\n",
693
+ " border-bottom-color: var(--fill-color);\n",
694
+ " }\n",
695
+ " 90% {\n",
696
+ " border-color: transparent;\n",
697
+ " border-bottom-color: var(--fill-color);\n",
698
+ " }\n",
699
+ " }\n",
700
+ "</style>\n",
701
+ "\n",
702
+ " <script>\n",
703
+ " async function quickchart(key) {\n",
704
+ " const quickchartButtonEl =\n",
705
+ " document.querySelector('#' + key + ' button');\n",
706
+ " quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
707
+ " quickchartButtonEl.classList.add('colab-df-spinner');\n",
708
+ " try {\n",
709
+ " const charts = await google.colab.kernel.invokeFunction(\n",
710
+ " 'suggestCharts', [key], {});\n",
711
+ " } catch (error) {\n",
712
+ " console.error('Error during call to suggestCharts:', error);\n",
713
+ " }\n",
714
+ " quickchartButtonEl.classList.remove('colab-df-spinner');\n",
715
+ " quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
716
+ " }\n",
717
+ " (() => {\n",
718
+ " let quickchartButtonEl =\n",
719
+ " document.querySelector('#df-e4f19412-ddac-41a8-b87a-879d75400e74 button');\n",
720
+ " quickchartButtonEl.style.display =\n",
721
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
722
+ " })();\n",
723
+ " </script>\n",
724
+ " </div>\n",
725
+ "\n",
726
+ " </div>\n",
727
+ " </div>\n"
728
+ ],
729
+ "application/vnd.google.colaboratory.intrinsic+json": {
730
+ "type": "dataframe",
731
+ "summary": "{\n \"name\": \"display(df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"header\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"6TME_2|Chains C, D|Protein RALF-like 4|Arabidopsis thaliana (3702)\",\n \"6TME_1|Chains A, B|Pollen-specific leucine-rich repeat extensin-like protein 1|Arabidopsis thaliana (3702)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sequence\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAITHCYR\",\n \"MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKRAYIALQAWKKAFYSDPFNTAANWVGPDVCSYKGVFCAPALDDPSVLVVAGIDLNHADIFGYLPPELGLLTDVALFHVNSNRFCGVIPKSLSKLTLMYEFDVSNNRFVGPFPTVALSWPSLKFLDIRYNDFEGKLPPEIFDKDLDAIFLNNNRFESTIPETIGKSTASVVTFAHNKFSGCIPKTIGQMKNLNEIVFIGNNLSGCLPNEIGSLNNVTVFDASSNGFVGSLPSTLSGLANVEQMDFSYNKFTGFVTDNICKLPKLSNFTFSYNFFNGEAQSCVPGSSQEKQFDDTSNCLQNRPNQKSAKECLPVVSRPVDCSKDKCAGG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
732
+ }
733
+ },
734
+ "metadata": {}
735
+ }
736
+ ]
737
+ },
738
+ {
739
+ "cell_type": "code",
740
+ "source": [
741
+ "# @title 4. Run Inference\n",
742
+ "\n",
743
+ "from torch.utils.data import DataLoader, TensorDataset\n",
744
+ "\n",
745
+ "print(\"Tokenizing sequences\")\n",
746
+ "sequences = df['sequence'].tolist()\n",
747
+ "encoded = tokenizer(\n",
748
+ " sequences,\n",
749
+ " padding=True,\n",
750
+ " truncation=True,\n",
751
+ " max_length=MAX_LEN,\n",
752
+ " return_tensors=\"pt\"\n",
753
+ ")\n",
754
+ "\n",
755
+ "dataset = TensorDataset(encoded['input_ids'], encoded['attention_mask'])\n",
756
+ "dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)\n",
757
+ "\n",
758
+ "all_probs = []\n",
759
+ "all_preds = []\n",
760
+ "\n",
761
+ "print(\"Running inference\")\n",
762
+ "with torch.no_grad():\n",
763
+ " for i, batch in enumerate(dataloader):\n",
764
+ " input_ids, attention_mask = [b.to(device) for b in batch]\n",
765
+ " outputs = model(input_ids=input_ids, attention_mask=attention_mask)\n",
766
+ " logits = outputs.logits\n",
767
+ " probs = torch.softmax(logits, dim=1).cpu().numpy()\n",
768
+ " preds = np.argmax(probs, axis=1)\n",
769
+ " all_probs.extend(probs)\n",
770
+ " all_preds.extend(preds)\n",
771
+ "\n",
772
+ " if (i + 1) % 10 == 0 or (i + 1) == len(dataloader):\n",
773
+ " processed = min((i + 1) * BATCH_SIZE, len(sequences))\n",
774
+ " print(f\"Processed {processed} of {len(sequences)} sequences\")\n",
775
+ "\n",
776
+ "print(\"Inference completed\")"
777
+ ],
778
+ "metadata": {
779
+ "colab": {
780
+ "base_uri": "https://localhost:8080/"
781
+ },
782
+ "id": "nwHd1DRVUn_e",
783
+ "outputId": "96ebfb56-ae1c-4254-8476-c0814b924b13"
784
+ },
785
+ "execution_count": 17,
786
+ "outputs": [
787
+ {
788
+ "output_type": "stream",
789
+ "name": "stdout",
790
+ "text": [
791
+ "Tokenizing sequences\n",
792
+ "Running inference\n",
793
+ "Processed 2 of 2 sequences\n",
794
+ "Inference completed\n"
795
+ ]
796
+ }
797
+ ]
798
+ },
799
+ {
800
+ "cell_type": "code",
801
+ "source": [
802
+ "# @title 5. Results and Download\n",
803
+ "\n",
804
+ "confidence = [p[pred] for p, pred in zip(all_probs, all_preds)]\n",
805
+ "df['predicted_class_id'] = all_preds\n",
806
+ "df['confidence'] = confidence\n",
807
+ "df['prob_class_0'] = [p[0] for p in all_probs]\n",
808
+ "df['prob_class_1'] = [p[1] for p in all_probs]\n",
809
+ "\n",
810
+ "df['predicted_class'] = df['predicted_class_id'].map({\n",
811
+ " 0: \"Negative (Non-PI)\",\n",
812
+ " 1: \"Positive (Potential PI)\"\n",
813
+ "})\n",
814
+ "\n",
815
+ "display(HTML(\"<h3>Prediction Results (first 10 sequences)</h3>\"))\n",
816
+ "display(df[['header', 'predicted_class', 'confidence', 'prob_class_1']].head(10))\n",
817
+ "\n",
818
+ "print(\"\\nClass distribution\")\n",
819
+ "counts = df['predicted_class'].value_counts()\n",
820
+ "for label, count in counts.items():\n",
821
+ " percentage = count / len(df) * 100\n",
822
+ " print(f\"{label}: {count} sequences ({percentage:.1f}%)\")\n",
823
+ "\n",
824
+ "output_csv = \"PIPES-M_predictions.csv\"\n",
825
+ "df.to_csv(output_csv, index=False)\n",
826
+ "\n",
827
+ "if mount_drive:\n",
828
+ " drive_path = \"/content/drive/MyDrive/PIPES-M_predictions.csv\"\n",
829
+ " df.to_csv(drive_path, index=False)\n",
830
+ " print(f\"\\nResults also saved to Google Drive: {drive_path}\")\n",
831
+ "\n",
832
+ "print(f\"\\nResults saved as {output_csv}\")\n",
833
+ "files.download(output_csv)"
834
+ ],
835
+ "metadata": {
836
+ "colab": {
837
+ "base_uri": "https://localhost:8080/",
838
+ "height": 278
839
+ },
840
+ "id": "A3fPg8TaUu2k",
841
+ "outputId": "bdd02de6-60a6-4236-d09b-e7af9319fc8e"
842
+ },
843
+ "execution_count": 18,
844
+ "outputs": [
845
+ {
846
+ "output_type": "display_data",
847
+ "data": {
848
+ "text/plain": [
849
+ "<IPython.core.display.HTML object>"
850
+ ],
851
+ "text/html": [
852
+ "<h3>Prediction Results (first 10 sequences)</h3>"
853
+ ]
854
+ },
855
+ "metadata": {}
856
+ },
857
+ {
858
+ "output_type": "display_data",
859
+ "data": {
860
+ "text/plain": [
861
+ " header predicted_class \\\n",
862
+ "0 6TME_1|Chains A, B|Pollen-specific leucine-ric... Positive (Potential PI) \n",
863
+ "1 6TME_2|Chains C, D|Protein RALF-like 4|Arabido... Positive (Potential PI) \n",
864
+ "\n",
865
+ " confidence prob_class_1 \n",
866
+ "0 0.947041 0.947041 \n",
867
+ "1 0.965963 0.965963 "
868
+ ],
869
+ "text/html": [
870
+ "\n",
871
+ " <div id=\"df-10af0c1e-3834-4264-8a23-78bb419eb305\" class=\"colab-df-container\">\n",
872
+ " <div>\n",
873
+ "<style scoped>\n",
874
+ " .dataframe tbody tr th:only-of-type {\n",
875
+ " vertical-align: middle;\n",
876
+ " }\n",
877
+ "\n",
878
+ " .dataframe tbody tr th {\n",
879
+ " vertical-align: top;\n",
880
+ " }\n",
881
+ "\n",
882
+ " .dataframe thead th {\n",
883
+ " text-align: right;\n",
884
+ " }\n",
885
+ "</style>\n",
886
+ "<table border=\"1\" class=\"dataframe\">\n",
887
+ " <thead>\n",
888
+ " <tr style=\"text-align: right;\">\n",
889
+ " <th></th>\n",
890
+ " <th>header</th>\n",
891
+ " <th>predicted_class</th>\n",
892
+ " <th>confidence</th>\n",
893
+ " <th>prob_class_1</th>\n",
894
+ " </tr>\n",
895
+ " </thead>\n",
896
+ " <tbody>\n",
897
+ " <tr>\n",
898
+ " <th>0</th>\n",
899
+ " <td>6TME_1|Chains A, B|Pollen-specific leucine-ric...</td>\n",
900
+ " <td>Positive (Potential PI)</td>\n",
901
+ " <td>0.947041</td>\n",
902
+ " <td>0.947041</td>\n",
903
+ " </tr>\n",
904
+ " <tr>\n",
905
+ " <th>1</th>\n",
906
+ " <td>6TME_2|Chains C, D|Protein RALF-like 4|Arabido...</td>\n",
907
+ " <td>Positive (Potential PI)</td>\n",
908
+ " <td>0.965963</td>\n",
909
+ " <td>0.965963</td>\n",
910
+ " </tr>\n",
911
+ " </tbody>\n",
912
+ "</table>\n",
913
+ "</div>\n",
914
+ " <div class=\"colab-df-buttons\">\n",
915
+ "\n",
916
+ " <div class=\"colab-df-container\">\n",
917
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-10af0c1e-3834-4264-8a23-78bb419eb305')\"\n",
918
+ " title=\"Convert this dataframe to an interactive table.\"\n",
919
+ " style=\"display:none;\">\n",
920
+ "\n",
921
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
922
+ " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
923
+ " </svg>\n",
924
+ " </button>\n",
925
+ "\n",
926
+ " <style>\n",
927
+ " .colab-df-container {\n",
928
+ " display:flex;\n",
929
+ " gap: 12px;\n",
930
+ " }\n",
931
+ "\n",
932
+ " .colab-df-convert {\n",
933
+ " background-color: #E8F0FE;\n",
934
+ " border: none;\n",
935
+ " border-radius: 50%;\n",
936
+ " cursor: pointer;\n",
937
+ " display: none;\n",
938
+ " fill: #1967D2;\n",
939
+ " height: 32px;\n",
940
+ " padding: 0 0 0 0;\n",
941
+ " width: 32px;\n",
942
+ " }\n",
943
+ "\n",
944
+ " .colab-df-convert:hover {\n",
945
+ " background-color: #E2EBFA;\n",
946
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
947
+ " fill: #174EA6;\n",
948
+ " }\n",
949
+ "\n",
950
+ " .colab-df-buttons div {\n",
951
+ " margin-bottom: 4px;\n",
952
+ " }\n",
953
+ "\n",
954
+ " [theme=dark] .colab-df-convert {\n",
955
+ " background-color: #3B4455;\n",
956
+ " fill: #D2E3FC;\n",
957
+ " }\n",
958
+ "\n",
959
+ " [theme=dark] .colab-df-convert:hover {\n",
960
+ " background-color: #434B5C;\n",
961
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
962
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
963
+ " fill: #FFFFFF;\n",
964
+ " }\n",
965
+ " </style>\n",
966
+ "\n",
967
+ " <script>\n",
968
+ " const buttonEl =\n",
969
+ " document.querySelector('#df-10af0c1e-3834-4264-8a23-78bb419eb305 button.colab-df-convert');\n",
970
+ " buttonEl.style.display =\n",
971
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
972
+ "\n",
973
+ " async function convertToInteractive(key) {\n",
974
+ " const element = document.querySelector('#df-10af0c1e-3834-4264-8a23-78bb419eb305');\n",
975
+ " const dataTable =\n",
976
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
977
+ " [key], {});\n",
978
+ " if (!dataTable) return;\n",
979
+ "\n",
980
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
981
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
982
+ " + ' to learn more about interactive tables.';\n",
983
+ " element.innerHTML = '';\n",
984
+ " dataTable['output_type'] = 'display_data';\n",
985
+ " await google.colab.output.renderOutput(dataTable, element);\n",
986
+ " const docLink = document.createElement('div');\n",
987
+ " docLink.innerHTML = docLinkHtml;\n",
988
+ " element.appendChild(docLink);\n",
989
+ " }\n",
990
+ " </script>\n",
991
+ " </div>\n",
992
+ "\n",
993
+ "\n",
994
+ " <div id=\"df-eb297cd7-50c1-4731-af63-06b835a7286f\">\n",
995
+ " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-eb297cd7-50c1-4731-af63-06b835a7286f')\"\n",
996
+ " title=\"Suggest charts\"\n",
997
+ " style=\"display:none;\">\n",
998
+ "\n",
999
+ "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
1000
+ " width=\"24px\">\n",
1001
+ " <g>\n",
1002
+ " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
1003
+ " </g>\n",
1004
+ "</svg>\n",
1005
+ " </button>\n",
1006
+ "\n",
1007
+ "<style>\n",
1008
+ " .colab-df-quickchart {\n",
1009
+ " --bg-color: #E8F0FE;\n",
1010
+ " --fill-color: #1967D2;\n",
1011
+ " --hover-bg-color: #E2EBFA;\n",
1012
+ " --hover-fill-color: #174EA6;\n",
1013
+ " --disabled-fill-color: #AAA;\n",
1014
+ " --disabled-bg-color: #DDD;\n",
1015
+ " }\n",
1016
+ "\n",
1017
+ " [theme=dark] .colab-df-quickchart {\n",
1018
+ " --bg-color: #3B4455;\n",
1019
+ " --fill-color: #D2E3FC;\n",
1020
+ " --hover-bg-color: #434B5C;\n",
1021
+ " --hover-fill-color: #FFFFFF;\n",
1022
+ " --disabled-bg-color: #3B4455;\n",
1023
+ " --disabled-fill-color: #666;\n",
1024
+ " }\n",
1025
+ "\n",
1026
+ " .colab-df-quickchart {\n",
1027
+ " background-color: var(--bg-color);\n",
1028
+ " border: none;\n",
1029
+ " border-radius: 50%;\n",
1030
+ " cursor: pointer;\n",
1031
+ " display: none;\n",
1032
+ " fill: var(--fill-color);\n",
1033
+ " height: 32px;\n",
1034
+ " padding: 0;\n",
1035
+ " width: 32px;\n",
1036
+ " }\n",
1037
+ "\n",
1038
+ " .colab-df-quickchart:hover {\n",
1039
+ " background-color: var(--hover-bg-color);\n",
1040
+ " box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
1041
+ " fill: var(--button-hover-fill-color);\n",
1042
+ " }\n",
1043
+ "\n",
1044
+ " .colab-df-quickchart-complete:disabled,\n",
1045
+ " .colab-df-quickchart-complete:disabled:hover {\n",
1046
+ " background-color: var(--disabled-bg-color);\n",
1047
+ " fill: var(--disabled-fill-color);\n",
1048
+ " box-shadow: none;\n",
1049
+ " }\n",
1050
+ "\n",
1051
+ " .colab-df-spinner {\n",
1052
+ " border: 2px solid var(--fill-color);\n",
1053
+ " border-color: transparent;\n",
1054
+ " border-bottom-color: var(--fill-color);\n",
1055
+ " animation:\n",
1056
+ " spin 1s steps(1) infinite;\n",
1057
+ " }\n",
1058
+ "\n",
1059
+ " @keyframes spin {\n",
1060
+ " 0% {\n",
1061
+ " border-color: transparent;\n",
1062
+ " border-bottom-color: var(--fill-color);\n",
1063
+ " border-left-color: var(--fill-color);\n",
1064
+ " }\n",
1065
+ " 20% {\n",
1066
+ " border-color: transparent;\n",
1067
+ " border-left-color: var(--fill-color);\n",
1068
+ " border-top-color: var(--fill-color);\n",
1069
+ " }\n",
1070
+ " 30% {\n",
1071
+ " border-color: transparent;\n",
1072
+ " border-left-color: var(--fill-color);\n",
1073
+ " border-top-color: var(--fill-color);\n",
1074
+ " border-right-color: var(--fill-color);\n",
1075
+ " }\n",
1076
+ " 40% {\n",
1077
+ " border-color: transparent;\n",
1078
+ " border-right-color: var(--fill-color);\n",
1079
+ " border-top-color: var(--fill-color);\n",
1080
+ " }\n",
1081
+ " 60% {\n",
1082
+ " border-color: transparent;\n",
1083
+ " border-right-color: var(--fill-color);\n",
1084
+ " }\n",
1085
+ " 80% {\n",
1086
+ " border-color: transparent;\n",
1087
+ " border-right-color: var(--fill-color);\n",
1088
+ " border-bottom-color: var(--fill-color);\n",
1089
+ " }\n",
1090
+ " 90% {\n",
1091
+ " border-color: transparent;\n",
1092
+ " border-bottom-color: var(--fill-color);\n",
1093
+ " }\n",
1094
+ " }\n",
1095
+ "</style>\n",
1096
+ "\n",
1097
+ " <script>\n",
1098
+ " async function quickchart(key) {\n",
1099
+ " const quickchartButtonEl =\n",
1100
+ " document.querySelector('#' + key + ' button');\n",
1101
+ " quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
1102
+ " quickchartButtonEl.classList.add('colab-df-spinner');\n",
1103
+ " try {\n",
1104
+ " const charts = await google.colab.kernel.invokeFunction(\n",
1105
+ " 'suggestCharts', [key], {});\n",
1106
+ " } catch (error) {\n",
1107
+ " console.error('Error during call to suggestCharts:', error);\n",
1108
+ " }\n",
1109
+ " quickchartButtonEl.classList.remove('colab-df-spinner');\n",
1110
+ " quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
1111
+ " }\n",
1112
+ " (() => {\n",
1113
+ " let quickchartButtonEl =\n",
1114
+ " document.querySelector('#df-eb297cd7-50c1-4731-af63-06b835a7286f button');\n",
1115
+ " quickchartButtonEl.style.display =\n",
1116
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
1117
+ " })();\n",
1118
+ " </script>\n",
1119
+ " </div>\n",
1120
+ "\n",
1121
+ " </div>\n",
1122
+ " </div>\n"
1123
+ ],
1124
+ "application/vnd.google.colaboratory.intrinsic+json": {
1125
+ "type": "dataframe",
1126
+ "summary": "{\n \"name\": \"files\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"header\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"6TME_2|Chains C, D|Protein RALF-like 4|Arabidopsis thaliana (3702)\",\n \"6TME_1|Chains A, B|Pollen-specific leucine-rich repeat extensin-like protein 1|Arabidopsis thaliana (3702)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_class\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Positive (Potential PI)\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"confidence\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 2,\n \"samples\": [\n 0.9659631848335266\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"prob_class_1\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 2,\n \"samples\": [\n 0.9659631848335266\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
1127
+ }
1128
+ },
1129
+ "metadata": {}
1130
+ },
1131
+ {
1132
+ "output_type": "stream",
1133
+ "name": "stdout",
1134
+ "text": [
1135
+ "\n",
1136
+ "Class distribution\n",
1137
+ "Positive (Potential PI): 2 sequences (100.0%)\n",
1138
+ "\n",
1139
+ "Results also saved to Google Drive: /content/drive/MyDrive/PIPES-M_predictions.csv\n",
1140
+ "\n",
1141
+ "Results saved as PIPES-M_predictions.csv\n"
1142
+ ]
1143
+ },
1144
+ {
1145
+ "output_type": "display_data",
1146
+ "data": {
1147
+ "text/plain": [
1148
+ "<IPython.core.display.Javascript object>"
1149
+ ],
1150
+ "application/javascript": [
1151
+ "\n",
1152
+ " async function download(id, filename, size) {\n",
1153
+ " if (!google.colab.kernel.accessAllowed) {\n",
1154
+ " return;\n",
1155
+ " }\n",
1156
+ " const div = document.createElement('div');\n",
1157
+ " const label = document.createElement('label');\n",
1158
+ " label.textContent = `Downloading \"${filename}\": `;\n",
1159
+ " div.appendChild(label);\n",
1160
+ " const progress = document.createElement('progress');\n",
1161
+ " progress.max = size;\n",
1162
+ " div.appendChild(progress);\n",
1163
+ " document.body.appendChild(div);\n",
1164
+ "\n",
1165
+ " const buffers = [];\n",
1166
+ " let downloaded = 0;\n",
1167
+ "\n",
1168
+ " const channel = await google.colab.kernel.comms.open(id);\n",
1169
+ " // Send a message to notify the kernel that we're ready.\n",
1170
+ " channel.send({})\n",
1171
+ "\n",
1172
+ " for await (const message of channel.messages) {\n",
1173
+ " // Send a message to notify the kernel that we're ready.\n",
1174
+ " channel.send({})\n",
1175
+ " if (message.buffers) {\n",
1176
+ " for (const buffer of message.buffers) {\n",
1177
+ " buffers.push(buffer);\n",
1178
+ " downloaded += buffer.byteLength;\n",
1179
+ " progress.value = downloaded;\n",
1180
+ " }\n",
1181
+ " }\n",
1182
+ " }\n",
1183
+ " const blob = new Blob(buffers, {type: 'application/binary'});\n",
1184
+ " const a = document.createElement('a');\n",
1185
+ " a.href = window.URL.createObjectURL(blob);\n",
1186
+ " a.download = filename;\n",
1187
+ " div.appendChild(a);\n",
1188
+ " a.click();\n",
1189
+ " div.remove();\n",
1190
+ " }\n",
1191
+ " "
1192
+ ]
1193
+ },
1194
+ "metadata": {}
1195
+ },
1196
+ {
1197
+ "output_type": "display_data",
1198
+ "data": {
1199
+ "text/plain": [
1200
+ "<IPython.core.display.Javascript object>"
1201
+ ],
1202
+ "application/javascript": [
1203
+ "download(\"download_b408fcdf-a1a5-4daf-973f-d965a8b95af4\", \"PIPES-M_predictions.csv\", 807)"
1204
+ ]
1205
+ },
1206
+ "metadata": {}
1207
+ }
1208
+ ]
1209
+ }
1210
+ ]
1211
+ }