{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore AI4LifeScience Tools in OpenBioMed\n",
    "\n",
    "OpenBioMed implements a suite of AIs tools for accelerating life science research including:\n",
    "- molecular property prediction\n",
    "- molecule editing\n",
    "- text-based denovo molecule generation\n",
    "- protein function prediction\n",
    "- protein folding\n",
    "- denovo protein generation\n",
    "- protein mutation explanation & engineering\n",
    "- protein-molecule docking\n",
    "- structure-based drug design\n",
    "\n",
    "Feel free to [download](https://cloud.tsinghua.edu.cn/d/5d08f4bc502848dc83bd/) our trained models, put them under `checkpoints/server`, and explore their applications with your own data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/AIRvePFS/dair/luoyz-data/projects/OpenBioMed/OpenBioMed_arch\n"
     ]
    }
   ],
   "source": [
    "# Change working directory\n",
    "import os\n",
    "import sys\n",
    "parent = os.path.dirname(os.path.abspath(''))\n",
    "print(parent)\n",
    "sys.path.append(parent)\n",
    "os.chdir(parent)\n",
    "\n",
    "import logging\n",
    "logging.basicConfig(level=logging.ERROR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In OpenBioMed, we provide a unified interface for deploying ML-based models and performing prediction through `InferencePipeline`. To construct a pipeline, you just need to configure the task, model, path to the trained checkpoint, and which device to deploy the model. \n",
    "\n",
    "You can use the pipeline.print_usage() function to identify the inputs and the outputs of the model. To construct appropriate inputs for molecule and protein inputs of the model, please refer to [manipulating_molecules](./manipulating_molecules.ipynb). \n",
    "\n",
    "Then, you can pass the inputs to pipeline.run() method to perform prediction. It accepts either single input or multiple inputs. The return value is a tuple, where the first element is a list of the original model outputs, and the second element is a list of metadata for building workflows (which you can simply ignore).\n",
    "\n",
    "Here we provide examples on two tasks. You can also modify model inputs [here](../open_biomed/scripts/inference.py) and run `python open_biomed/scripts/inference.py --task [TASK_NAME]` to test any task you are interested in.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecular property prediction.\n",
      "Inputs: {\"molecule\": a small molecule}\n",
      "Outputs: A float number in [0, 1] indicating the likeness of the molecule to exhibit certain properties.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inference Steps: 100%|██████████| 1/1 [00:00<00:00, 174.02it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.582], [0.8478]]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from open_biomed.core.pipeline import InferencePipeline\n",
    "from open_biomed.data.molecule import Molecule\n",
    "\n",
    "# Predict if a molecule can penetrate the blood-brain barrier (https://arxiv.org/abs/1703.00564) with a fine-tuned GraphMVP (https://arxiv.org/abs/2110.07728) model\n",
    "pipeline = InferencePipeline(\n",
    "    task=\"molecule_property_prediction\",\n",
    "    model=\"graphmvp\",\n",
    "    model_ckpt=\"./checkpoints/demo/graphmvp-BBBP.ckpt\",\n",
    "    additional_config=\"./configs/dataset/bbbp.yaml\",\n",
    "    device=\"cpu\"\n",
    ")\n",
    "print(pipeline.print_usage())\n",
    "\n",
    "# Construct molecules via SMILES strings\n",
    "molecule1 = Molecule.from_smiles(\"Nc1[nH]c(C(=O)c2ccccc2)c(-c2ccccn2)c1C(=O)c1c[nH]c2ccc(Br)cc12\")\n",
    "molecule2 = Molecule.from_smiles(\"CN1CCC[C@H]1COC2=NC3=C(CCN(C3)C4=CC=CC5=C4C(=CC=C5)Cl)C(=N2)N6CCN([C@H](C6)CC#N)C(=O)C(=C)F\")\n",
    "\n",
    "# The tool can handle multiple inputs simutaneously\n",
    "outputs = pipeline.run(\n",
    "    molecule=[molecule1, molecule2]\n",
    ")[0]\n",
    "print(outputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of EsmForProteinFolding were not initialized from the model checkpoint at /AIRvePFS/dair/users/ailin/.cache/huggingface/hub/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Protein folding prediction.\n",
      "Inputs: {\"protein\": a protein sequence}\n",
      "Outputs: A protein object with 3D structure available.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inference Steps: 100%|██████████| 1/1 [00:05<00:00,  5.26s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'./tmp/folded_protein.pdb'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from open_biomed.core.pipeline import InferencePipeline\n",
    "from open_biomed.data.protein import Protein\n",
    "\n",
    "# Predict the 3D structure of the protein based on its amino acid sequence using EsmFold (https://www.science.org/doi/10.1126/science.ade2574)\n",
    "# REMARK: It is recommended to use a GPU with at least 16GB memory to speed up inference. If you don't have a NVIDIA GPU, change the `device` argument to `cpu`.\n",
    "pipeline = InferencePipeline(\n",
    "    task=\"protein_folding\",\n",
    "    model=\"esmfold\",\n",
    "    model_ckpt=\"./checkpoints/demo/esmfold.ckpt\",\n",
    "    device=\"cuda:0\"            \n",
    ")\n",
    "print(pipeline.print_usage())\n",
    "\n",
    "# Initialize a protein with an amino acid sequence\n",
    "protein = Protein.from_fasta(\"MASDAAAEPSSGVTHPPRYVIGYALAPKKQQSFIQPSLVAQAASRGMDLVPVDASQPLAEQGPFHLLIHALYGDDWRAQLVAFAARHPAVPIVDPPHAIDRLHNRISMLQVVSELDHAADQDSTFGIPSQVVVYDAAALADFGLLAALRFPLIAKPLVADGTAKSHKMSLVYHREGLGKLRPPLVLQEFVNHGGVIFKVYVVGGHVTCVKRRSLPDVSPEDDASAQGSVSFSQVSNLPTERTAEEYYGEKSLEDAVVPPAAFINQIAGGLRRALGLQLFNFDMIRDVRAGDRYLVIDINYFPGYAKMPGYETVLTDFFWEMVHKDGVGNQQEEKGANHVVVK\")\n",
    "outputs = pipeline.run(\n",
    "    protein=protein,\n",
    ")\n",
    "# The output is still a Protein object, but its 3D backbone coordinates are available\n",
    "# You can find the pdb file or use our [visualization tools](./visualization.ipynb) to inspect the structure\n",
    "outputs[0][0].save_pdb(\"./tmp/folded_protein.pdb\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.7 ('biomed')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "2b5492c31ef84abdc69aadb95e4c210f44c226a5800d1d766b22f7a50017392c"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}