{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore AI4LifeScience Tools in OpenBioMed\n", "\n", "OpenBioMed implements a suite of AIs tools for accelerating life science research including:\n", "- molecular property prediction\n", "- molecule editing\n", "- text-based denovo molecule generation\n", "- protein function prediction\n", "- protein folding\n", "- denovo protein generation\n", "- protein mutation explanation & engineering\n", "- protein-molecule docking\n", "- structure-based drug design\n", "\n", "Feel free to [download](https://cloud.tsinghua.edu.cn/d/5d08f4bc502848dc83bd/) our trained models, put them under `checkpoints/server`, and explore their applications with your own data!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/AIRvePFS/dair/luoyz-data/projects/OpenBioMed/OpenBioMed_arch\n" ] } ], "source": [ "# Change working directory\n", "import os\n", "import sys\n", "parent = os.path.dirname(os.path.abspath(''))\n", "print(parent)\n", "sys.path.append(parent)\n", "os.chdir(parent)\n", "\n", "import logging\n", "logging.basicConfig(level=logging.ERROR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In OpenBioMed, we provide a unified interface for deploying ML-based models and performing prediction through `InferencePipeline`. To construct a pipeline, you just need to configure the task, model, path to the trained checkpoint, and which device to deploy the model. \n", "\n", "You can use the pipeline.print_usage() function to identify the inputs and the outputs of the model. To construct appropriate inputs for molecule and protein inputs of the model, please refer to [manipulating_molecules](./manipulating_molecules.ipynb). \n", "\n", "Then, you can pass the inputs to pipeline.run() method to perform prediction. It accepts either single input or multiple inputs. The return value is a tuple, where the first element is a list of the original model outputs, and the second element is a list of metadata for building workflows (which you can simply ignore).\n", "\n", "Here we provide examples on two tasks. You can also modify model inputs [here](../open_biomed/scripts/inference.py) and run `python open_biomed/scripts/inference.py --task [TASK_NAME]` to test any task you are interested in.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Molecular property prediction.\n", "Inputs: {\"molecule\": a small molecule}\n", "Outputs: A float number in [0, 1] indicating the likeness of the molecule to exhibit certain properties.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Inference Steps: 100%|██████████| 1/1 [00:00<00:00, 174.02it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[[0.582], [0.8478]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from open_biomed.core.pipeline import InferencePipeline\n", "from open_biomed.data.molecule import Molecule\n", "\n", "# Predict if a molecule can penetrate the blood-brain barrier (https://arxiv.org/abs/1703.00564) with a fine-tuned GraphMVP (https://arxiv.org/abs/2110.07728) model\n", "pipeline = InferencePipeline(\n", " task=\"molecule_property_prediction\",\n", " model=\"graphmvp\",\n", " model_ckpt=\"./checkpoints/demo/graphmvp-BBBP.ckpt\",\n", " additional_config=\"./configs/dataset/bbbp.yaml\",\n", " device=\"cpu\"\n", ")\n", "print(pipeline.print_usage())\n", "\n", "# Construct molecules via SMILES strings\n", "molecule1 = Molecule.from_smiles(\"Nc1[nH]c(C(=O)c2ccccc2)c(-c2ccccn2)c1C(=O)c1c[nH]c2ccc(Br)cc12\")\n", "molecule2 = Molecule.from_smiles(\"CN1CCC[C@H]1COC2=NC3=C(CCN(C3)C4=CC=CC5=C4C(=CC=C5)Cl)C(=N2)N6CCN([C@H](C6)CC#N)C(=O)C(=C)F\")\n", "\n", "# The tool can handle multiple inputs simutaneously\n", "outputs = pipeline.run(\n", " molecule=[molecule1, molecule2]\n", ")[0]\n", "print(outputs)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of EsmForProteinFolding were not initialized from the model checkpoint at /AIRvePFS/dair/users/ailin/.cache/huggingface/hub/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Protein folding prediction.\n", "Inputs: {\"protein\": a protein sequence}\n", "Outputs: A protein object with 3D structure available.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Inference Steps: 100%|██████████| 1/1 [00:05<00:00, 5.26s/it]\n" ] }, { "data": { "text/plain": [ "'./tmp/folded_protein.pdb'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from open_biomed.core.pipeline import InferencePipeline\n", "from open_biomed.data.protein import Protein\n", "\n", "# Predict the 3D structure of the protein based on its amino acid sequence using EsmFold (https://www.science.org/doi/10.1126/science.ade2574)\n", "# REMARK: It is recommended to use a GPU with at least 16GB memory to speed up inference. If you don't have a NVIDIA GPU, change the `device` argument to `cpu`.\n", "pipeline = InferencePipeline(\n", " task=\"protein_folding\",\n", " model=\"esmfold\",\n", " model_ckpt=\"./checkpoints/demo/esmfold.ckpt\",\n", " device=\"cuda:0\" \n", ")\n", "print(pipeline.print_usage())\n", "\n", "# Initialize a protein with an amino acid sequence\n", "protein = Protein.from_fasta(\"MASDAAAEPSSGVTHPPRYVIGYALAPKKQQSFIQPSLVAQAASRGMDLVPVDASQPLAEQGPFHLLIHALYGDDWRAQLVAFAARHPAVPIVDPPHAIDRLHNRISMLQVVSELDHAADQDSTFGIPSQVVVYDAAALADFGLLAALRFPLIAKPLVADGTAKSHKMSLVYHREGLGKLRPPLVLQEFVNHGGVIFKVYVVGGHVTCVKRRSLPDVSPEDDASAQGSVSFSQVSNLPTERTAEEYYGEKSLEDAVVPPAAFINQIAGGLRRALGLQLFNFDMIRDVRAGDRYLVIDINYFPGYAKMPGYETVLTDFFWEMVHKDGVGNQQEEKGANHVVVK\")\n", "outputs = pipeline.run(\n", " protein=protein,\n", ")\n", "# The output is still a Protein object, but its 3D backbone coordinates are available\n", "# You can find the pdb file or use our [visualization tools](./visualization.ipynb) to inspect the structure\n", "outputs[0][0].save_pdb(\"./tmp/folded_protein.pdb\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.7 ('biomed')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "2b5492c31ef84abdc69aadb95e4c210f44c226a5800d1d766b22f7a50017392c" } } }, "nbformat": 4, "nbformat_minor": 2 }