{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5077734e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configuration - Update these paths for your environment\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "# Create and navigate to kegg_data directory\n",
    "data_dir = Path('kegg_data')\n",
    "data_dir.mkdir(exist_ok=True)\n",
    "os.chdir(data_dir)\n",
    "\n",
    "# Configuration parameters\n",
    "CONFIG = {\n",
    "    # Output directories\n",
    "    'network_dir': 'kegg_network',\n",
    "    'variant_network_dir': 'network_variant', \n",
    "    'variant_info_dir': 'variant_info',\n",
    "    \n",
    "    # Reference data paths (update these to point to your reference files)\n",
    "    'cosmic_fusion_data': 'data/Cosmic_Fusion_v101_GRCh38.tsv',  # Update path as needed\n",
    "    'reference_genome': 'data/GRCh38_genomic.fna',  # Update path as needed\n",
    "    \n",
    "    # Processing parameters\n",
    "    'num_threads': 4,  # Adjust based on your system\n",
    "    'batch_size': 1000\n",
    "}\n",
    "\n",
    "# Create required directories\n",
    "for dir_name in [CONFIG['network_dir'], CONFIG['variant_network_dir'], CONFIG['variant_info_dir']]:\n",
    "    Path(dir_name).mkdir(exist_ok=True)\n",
    "\n",
    "print(f\"Working directory: {os.getcwd()}\")\n",
    "print(\"Configuration loaded. Directory structure created.\")\n",
    "print(\"\\n📝 Update CONFIG dictionary above with your actual file paths for reference data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b77a0f2c",
   "metadata": {},
   "source": [
    "# KEGG Data Processing Pipeline - Part 1: Data Retrieval and Network Analysis\n",
    "\n",
    "## Overview\n",
    "\n",
    "This notebook is the first part of a comprehensive KEGG (Kyoto Encyclopedia of Genes and Genomes) data processing pipeline for genetic variant analysis. It focuses on downloading and processing KEGG network data, disease associations, and variant information.\n",
    "\n",
    "## What This Notebook Does\n",
    "\n",
    "1. **KEGG Data Retrieval**: Downloads disease lists, network data, and pathway information from KEGG REST API\n",
    "2. **Network Analysis**: Processes KEGG network files to identify reference vs disease networks\n",
    "3. **Variant Extraction**: Identifies and extracts genetic variants from network data\n",
    "4. **Data Filtering**: Cleans and filters variant information for downstream analysis\n",
    "5. **Reference Data**: Processes genomic reference sequences and chromosome data\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "- Python 3.7+ with required packages (see requirements below)\n",
    "- `kegg_pull` package for KEGG data retrieval\n",
    "- `seqkit` for sequence processing\n",
    "- Internet connection for KEGG API access\n",
    "- Sufficient storage space (several GB for full dataset)\n",
    "\n",
    "## Required Packages\n",
    "\n",
    "```bash\n",
    "pip install kegg-pull biopython pandas\n",
    "```\n",
    "\n",
    "## Directory Structure\n",
    "\n",
    "This notebook expects and creates the following structure:\n",
    "```\n",
    "kegg_data/\n",
    "├── kegg_diseases.txt\n",
    "├── network_pathway.tsv\n",
    "├── network_disease.tsv\n",
    "├── kegg_network/\n",
    "├── network_variant/\n",
    "├── variant_info/\n",
    "└── output files...\n",
    "```\n",
    "\n",
    "## Important Notes\n",
    "\n",
    "- **Processing Time**: Full dataset processing can take several hours\n",
    "- **Storage Requirements**: ~5-10GB of storage needed for complete dataset\n",
    "- **API Limits**: KEGG REST API has rate limits; process may need pausing\n",
    "- **Network Access**: Requires stable internet connection for data downloads\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "After completing this notebook:\n",
    "1. Run `KEGG_Data_2.ipynb` for variant information parsing\n",
    "2. Run `KEGG_Data_3.ipynb` for final dataset creation with sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuration\n",
    "\n",
    "Set up paths and parameters for the data processing pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4297e63d-0309-45c4-920b-7a5cc1f42771",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d48693e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "curl -s \"https://rest.kegg.jp/list/disease\" > kegg_diseases.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6e489c3f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "KEGG_data.ipynb\t\tclassify.py\t\tmodel.py\n",
      "LICENSE\t\t\tdataset.py\t\tmodel_decoder.py\n",
      "README.md\t\tdna_classifier.py\tplayground.ipynb\n",
      "baseline.py\t\tfinetune.py\t\trequirements.txt\n",
      "baseline_model.py\tkegg_diseases.txt\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0cfda653",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1593\n"
     ]
    }
   ],
   "source": [
    "curl -s \"https://rest.kegg.jp/list/network\" | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b2c1ed0-90a5-4005-bdb0-41be43070a8b",
   "metadata": {},
   "source": [
    "Use kegg_pull for retrieving KEGG data https://github.com/MoseleyBioinformaticsLab/kegg_pull"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "998a046f-604c-4378-8563-3df7de0f85c3",
   "metadata": {},
   "source": [
    "```python3 -m pip install kegg-pull```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "65894de2-27f1-46c1-9eab-b54c7630fe86",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.1.0\n"
     ]
    }
   ],
   "source": [
    "kegg_pull -v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "63e1de4e-ee4a-4cba-aabe-9a801735e643",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Usage:\n",
      "    kegg_pull -h | --help           Show this help message.\n",
      "    kegg_pull -v | --version        Displays the package version.\n",
      "    kegg_pull --full-help           Show the help message of all sub commands.\n",
      "    kegg_pull pull ...              Pull, separate, and store an arbitrary number of KEGG entries to the local file system.\n",
      "    kegg_pull entry-ids ...         Obtain a list of KEGG entry IDs.\n",
      "    kegg_pull map ...               Obtain a mapping of entry IDs (KEGG or outside databases) to the IDs of related entries.\n",
      "    kegg_pull pathway-organizer ... Creates a flattened version of a pathways Brite hierarchy.\n",
      "    kegg_pull rest ...              Executes one of the KEGG REST API operations.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\n",
      "Usage:\n",
      "    kegg_pull pull -h | --help\n",
      "    kegg_pull pull database <database> [--force-single-entry] [--multi-process] [--n-workers=<n-workers>] [--output=<output>] [--print] [--sep=<print-separator>] [--entry-field=<entry-field>] [--n-tries=<n-tries>] [--time-out=<time-out>] [--sleep-time=<sleep-time>] [--ut=<unsuccessful-threshold>]\n",
      "    kegg_pull pull entry-ids <entry-ids> [--force-single-entry] [--multi-process] [--n-workers=<n-workers>] [--output=<output>] [--print] [--sep=<print-separator>] [--entry-field=<entry-field>] [--n-tries=<n-tries>] [--time-out=<time-out>] [--sleep-time=<sleep-time>] [--ut=<unsuccessful-threshold>]\n",
      "\n",
      "Options:\n",
      "    -h --help                       Show this help message.\n",
      "    database                        Pulls all the entries in a KEGG database.\n",
      "    <database>                      The KEGG database from which to pull entries.\n",
      "    --force-single-entry            Forces pulling only one entry at a time for every request to the KEGG web API. This flag is automatically set if <database> is \"brite\".\n",
      "    --multi-process                 If set, the entries are pulled across multiple processes to increase speed. Otherwise, the entries are pulled sequentially in a single process.\n",
      "    --n-workers=<n-workers>         The number of sub-processes to create when pulling. Defaults to the number of cores available. Ignored if --multi-process is not set.\n",
      "    --output=<output>               The directory where the pulled KEGG entries will be stored. Defaults to the current working directory. If ends in \".zip\", entries are saved to a ZIP archive instead of a directory. Ignored if --print is set.\n",
      "    --print                         If set, prints the entries to the screen rather than saving them to the file system. Separates entries by the --sep option if set.\n",
      "    --sep=<print-separator>         The string that separates the entries which are printed to the screen when the --print option is set. Ignored if the --print option is not set. Defaults to printing the entry id, followed by the entry, followed by a newline.\n",
      "    --entry-field=<entry-field>     Optional field to extract from the entries pulled rather than the standard flat file format (or \"htext\" in the case of brite entries).\n",
      "    --n-tries=<n-tries>             The number of times to attempt a KEGG request before marking it as timed out or failed. Defaults to 3.\n",
      "    --time-out=<time-out>           The number of seconds to wait for a KEGG request before marking it as timed out. Defaults to 60.\n",
      "    --sleep-time=<sleep-time>       The amount of time to wait after a KEGG request times out (or potentially blacklists with a 403 error code) before attempting it again. Defaults to 5.0.\n",
      "    --ut=<unsuccessful-threshold>   If set, the ratio of unsuccessful entry IDs (failed or timed out) to total entry IDs at which kegg_pull quits. Valid values are between 0.0 and 1.0 non-inclusive.\n",
      "    entry-ids                       Pulls entries specified by a comma separated list. Or from standard input: one entry ID per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull pull entry-ids - ...).\n",
      "    <entry-ids>                     Comma separated list of entry IDs to pull (e.g. id1,id2,id3 etc.). Or if equal to \"-\", entry IDs are read from standard input. Will likely need to set --force-single-entry if any of the entries are from the brite database.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\n",
      "Usage:\n",
      "    kegg_pull entry-ids -h | --help\n",
      "    kegg_pull entry-ids database <database> [--output=<output>]\n",
      "    kegg_pull entry-ids keywords <database> <keywords> [--output=<output>]\n",
      "    kegg_pull entry-ids molec-attr <database> (--formula=<formula>|--em=<exact-mass>...|--mw=<molecular-weight>...) [--output=<output>]\n",
      "\n",
      "Options:\n",
      "    -h --help               Show this help message.\n",
      "    database                Pulls all the entry IDs within a given database.\n",
      "    <database>              The KEGG database from which to pull a list of entry IDs.\n",
      "    --output=<output>       Path to the file (either in a directory or ZIP archive) to store the output (1 entry ID per line). Prints to the console if not specified. If a ZIP archive, the file path must be in the form of /path/to/zip-archive.zip:/path/to/file (e.g. ./archive.zip:file.txt).\n",
      "    keywords                Searches for entries within a database based on provided keywords.\n",
      "    <keywords>              Comma separated list of keywords to search entries with (e.g. kw1,kw2,kw3 etc.). Or if equal to \"-\", keywords are read from standard input, one keyword per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull rest find brite - ...).\n",
      "    molec-attr              Searches a database of molecule-type KEGG entries by molecular attributes.\n",
      "    --formula=<formula>     Sequence of atoms in a chemical formula format to search for (e.g. \"O5C7\" searches for molecule entries containing 5 oxygen atoms and/or 7 carbon atoms).\n",
      "    --em=<exact-mass>       Either a single number (e.g. \"--em=155.5\") or two numbers (e.g. \"--em=155.5 --em=244.4\"). If a single number, searches for molecule entries with an exact mass equal to that value rounded by the last decimal point. If two numbers, searches for molecule entries with an exact mass within the two values (a range).\n",
      "    --mw=<molecular-weight> Same as \"--em=<exact-mass>\" but searches based on the molecular weight.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\n",
      "Usage:\n",
      "    kegg_pull map -h | --help\n",
      "    kegg_pull map conv <kegg-database> <outside-database> [--reverse] [--output=<output>]\n",
      "    kegg_pull map link <source-database> <target-database> [--deduplicate] [--add-glycans] [--add-drugs] [--output=<output>]\n",
      "    kegg_pull map (link|conv) entry-ids <entry-ids> <target-database> [--reverse] [--output=<output>]\n",
      "    kegg_pull map link <source-database> <intermediate-database> <target-database> [--deduplicate] [--add-glycans] [--add-drugs] [--output=<output>]\n",
      "\n",
      "Options:\n",
      "    -h --help               Show this help message.\n",
      "    conv                    Converts the output of the KEGG \"conv\" operation into a JSON mapping.\n",
      "    <kegg-database>         The name of the KEGG database with entry IDs mapped to the outside database.\n",
      "    <outside-database>      The name of the outside database with entry IDs mapped from the KEGG database.\n",
      "    --reverse               Reverses the mapping with the target becoming the source and the source becoming the target.\n",
      "    --output=<output>       The location (either a directory or ZIP archive) of the JSON file to store the mapping. If not set, prints a JSON representation of the mapping to the console. If a ZIP archive, the file path must be in the form of /path/to/zip-archive.zip:/path/to/file (e.g. ./archive.zip:mapping.json).\n",
      "    link                    Converts the output of the KEGG \"link\" operation into a JSON mapping.\n",
      "    <source-database>       The name of the database with entry IDs mapped to the target database.\n",
      "    <target-database>       The name of the database with entry IDs mapped from the source database.\n",
      "    --deduplicate           Some mappings including pathway entry IDs result in half beginning with the normal \"path:map\" prefix but the other half with a different prefix. If set, removes the IDs corresponding to identical entries but with a different prefix. Raises an exception if neither the source nor the target database are \"pathway\".\n",
      "    --add-glycans           Whether to add the corresponding compound IDs of equivalent glycan entries. Logs a warning if neither the source nor the target database are \"compound\".\n",
      "    --add-drugs             Whether to add the corresponding compound IDs of equivalent drug entries. Logs a warning if neither the source nor the target database are \"compound\".\n",
      "    entry-ids               Create a mapping to a target database from a list of specific entry IDs.\n",
      "    <entry-ids>             Comma separated list of entry IDs (e.g. Id1,Id2,Id3 etc.). Or if equal to \"-\", entry IDs are read from standard input, one entry ID per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull map entry-ids drug - ...).\n",
      "    <intermediate-database> The name of an intermediate KEGG database with which to find cross-references to cross-references e.g. \"kegg_pull map link ko reaction compound\" creates a mapping from ko-to-compound via ko-to-reaction cross-references connected to reaction-to-compound cross-references.\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\n",
      "Usage:\n",
      "    kegg_pull pathway-organizer [--tln=<top-level-nodes>] [--fn=<filter-nodes>] [--output=<output>]\n",
      "\n",
      "Options:\n",
      "    -h --help               Show this help message.\n",
      "    --tln=<top-level-nodes> Node names in the highest level of the hierarchy to select from. If not set, all top level nodes are traversed to create the mapping of node key to node info. Either a comma separated list (e.g. node1,node2,node3 etc.) or if equal to \"-\", read from standard input one node per line; Press CTRL+D to finalize input or pipe (e.g. cat nodes.txt | kegg_pull pathway-organizer --tln=- ...). If both \"--tln\" and \"--fn\" are set as \"-\", one of the lines must be the delimiter \"---\" without quotes in order to distinguish the input, with the top level nodes first and filter nodes second.\n",
      "    --fn=<filter-nodes>     Names (not keys) of nodes to exclude from the mapping of node key to node info. Neither these nodes nor any of their children will be included. If not set, no nodes will be excluded. Either a comma separated list (e.g. node1,node2,node3 etc.) or if equal to \"-\", read from standard input one node per line; Press CTRL+D to finalize input or pipe (e.g. cat nodes.txt | kegg_pull pathway-organizer --fn=- ...). If both \"--tln\" and \"--fn\" are set as \"-\", one of the lines must be the delimiter \"---\" without quotes in order to distinguish the input, with the top level nodes first and filter nodes second.\n",
      "    --output=<output>       The file to store the flattened Brite hierarchy as a JSON structure with node keys mapping to node info, either a JSON file or ZIP archive. Prints to the console if not set. If saving to a ZIP archive, the file path must be in the form of /path/to/zip-archive.zip:/path/to/file (e.g. ./archive.zip:mapping.json).\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "\n",
      "Usage:\n",
      "    kegg_pull rest -h | --help\n",
      "    kegg_pull rest info <database> [--test] [--output=<output>]\n",
      "    kegg_pull rest list <database> [--test] [--output=<output>]\n",
      "    kegg_pull rest get <entry-ids> [--entry-field=<entry-field>] [--test] [--output=<output>]\n",
      "    kegg_pull rest find <database> <keywords> [--test] [--output=<output>]\n",
      "    kegg_pull rest find <database> (--formula=<formula>|--em=<exact-mass>...|--mw=<molecular-weight>...) [--test] [--output=<output>]\n",
      "    kegg_pull rest conv <kegg-database> <outside-database> [--test] [--output=<output>]\n",
      "    kegg_pull rest conv entry-ids <entry-ids> <target-database> [--test] [--output=<output>]\n",
      "    kegg_pull rest link <target-database> <source-database> [--test] [--output=<output>]\n",
      "    kegg_pull rest link entry-ids <entry-ids> <target-database> [--test] [--output=<output>]\n",
      "    kegg_pull rest ddi <drug-entry-ids> [--test] [--output=<output>]\n",
      "\n",
      "Options:\n",
      "    -h --help                   Show this help message.\n",
      "    info                        Executes the \"info\" KEGG API operation, pulling information about a KEGG database.\n",
      "    <database>                  The name of the database to pull information about or entry IDs from.\n",
      "    --test                      If set, test the request to ensure it works rather than sending it. Print True if the request would succeed and False if the request would fail. Ignores --output if this options is set along with --test.\n",
      "    --output=<output>           Path to the file (either in a directory or ZIP archive) to store the response body from the KEGG web API operation. Prints to the console if not specified. If a ZIP archive, the file path must be in the form of /path/to/zip-archive.zip:/path/to/file (e.g. ./archive.zip:file.txt).\n",
      "    list                        Executes the \"list\" KEGG API operation, pulling the entry IDs of the provided database.\n",
      "    get                         Executes the \"get\" KEGG API operation, pulling the entries of the provided entry IDs.\n",
      "    <entry-ids>                 Comma separated list of entry IDs (e.g. id1,id2,id3 etc.). Or if equal to \"-\", entry IDs are read from standard input, one entry ID per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull rest get - ...).\n",
      "    --entry-field=<entry-field> Optional field to extract from an entry instead of the default entry info (i.e. flat file or htext in the case of brite entries).\n",
      "    find                        Executes the \"find\" KEGG API operation, finding entry IDs based on provided queries.\n",
      "    <keywords>                  Comma separated list of keywords to search entries with (e.g. kw1,kw2,kw3 etc.). Or if equal to \"-\", keywords are read from standard input, one keyword per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull rest find brite - ...).\n",
      "    --formula=<formula>         Sequence of atoms in a chemical formula format to search for (e.g. \"O5C7\" searches for molecule entries containing 5 oxygen atoms and/or 7 carbon atoms).\n",
      "    --em=<exact-mass>           Either a single number (e.g. --em=155.5) or two numbers (e.g. --em=155.5 --em=244.4). If a single number, searches for molecule entries with an exact mass equal to that value rounded by the last decimal point. If two numbers, searches for molecule entries with an exact mass within the two values (a range).\n",
      "    --mw=<molecular-weight>     Same as --em but searches based on the molecular weight.\n",
      "    conv                        Executes the \"conv\" KEGG API operation, converting entry IDs from an outside database to those of a KEGG database and vice versa.\n",
      "    <kegg-database>             The name of the KEGG database from which to view equivalent outside database entry IDs.\n",
      "    <outside-database>          The name of the non-KEGG database from which to view equivalent KEGG database entry IDs.\n",
      "    entry-ids                   Perform the \"conv\" or \"link\" operation of the form that maps specific provided entry IDs to a target database.\n",
      "    link                        Executes the \"link\" KEGG API operation, showing the IDs of entries that are connected/related to entries of other databases.\n",
      "    <target-database>           The name of the database that the entry IDs of the source database or provided entry IDs are mapped to.\n",
      "    <source-database>           The name of the database from which cross-references are found in the target database.\n",
      "    ddi                         Executes the \"ddi\" KEGG API operation, searching for drug to drug interactions. Providing one entry ID reports all known interactions, while providing multiple checks if any drug pair in a given set of drugs is CI or P. If providing multiple, all entries must belong to the same database.\n",
      "    <drug-entry-ids>            Comma separated list of drug entry IDs from the following databases: drug, ndc, or yj (e.g. id1,id2,id3 etc.). Or if equal to \"-\", entry IDs are read from standard input, one entry ID per line; Press CTRL+D to finalize input or pipe (e.g. cat file.txt | kegg_pull rest ddi - ...).\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull --full-help"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "12e35258-92f8-4ece-9d18-177263d1e97c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N00001\tEGF-EGFR-RAS-ERK signaling pathway\n",
      "N00002\tBCR-ABL fusion kinase to RAS-ERK signaling pathway\n",
      "N00003\tMutation-activated KIT to RAS-ERK signaling pathway\n",
      "N00004\tDuplication or mutation-activated FLT3 to RAS-ERK signaling pathway\n",
      "N00005\tMutation-activated MET to RAS-ERK signaling pathway\n",
      "N00006\tAmplified EGFR to RAS-ERK signaling pathway\n",
      "N00007\tEML4-ALK fusion kinase to RAS-ERK signaling pathway\n",
      "N00008\tRET fusion kinase to RAS-ERK signaling pathway\n",
      "N00009\tTRK fusion kinase to RAS-ERK signaling pathway\n",
      "N00010\tMutation-inactivated PTCH1 to Hedgehog signaling pathway\n",
      "N00011\tMutation-activated FGFR3 to RAS-ERK signaling pathway\n",
      "N00012\tMutation-activated KRAS/NRAS to ERK signaling pathway\n",
      "N00013\tMutation-activated BRAF to ERK signaling pathway\n",
      "N00014\tMutation-activated EGFR to RAS-ERK signaling pathway\n",
      "N00015\tPDGF-PDGFR-RAS-ERK signaling pathway\n",
      "N00016\tPDGF-overexpression to RAS-ERK signaling pathway\n",
      "N00017\tMutation-activated SMO to Hedgehog signaling pathway\n",
      "N00018\tAmplified PDGFR to RAS-ERK signaling pathway\n",
      "N00019\tFGF-FGFR-RAS-ERK signaling pathway\n",
      "N00020\tAmplified FGFR to RAS-ERK signaling pathway\n",
      "N00021\tEGF-ERBB2-RAS-ERK signaling pathway\n",
      "N00022\tERBB2-overexpression to RAS-ERK signaling pathway\n",
      "N00023\tEGF-EGFR-PLCG-ERK signaling pathway\n",
      "N00024\tMutation-activated EGFR to PLCG-ERK signaling pathway\n",
      "N00025\tEML4-ALK fusion kinase to PLCG-ERK signaling pathway\n",
      "N00026\tEGF-EGFR-PLCG-CAMK signaling pathway\n",
      "N00027\tAmplified EGFR to PLCG-CAMK signaling pathway\n",
      "N00028\tPDGF-PDGFR-PLCG-CAMK signaling pathway\n",
      "N00029\tAmplified PDGFR to PLCG-CAMK signaling pathway\n",
      "N00030\tEGF-EGFR-RAS-PI3K signaling pathway\n",
      "N00031\tDuplication or mutation-activated FLT3 to RAS-PI3K signaling pathway\n",
      "N00032\tMutation-activated KRAS/NRAS to PI3K signaling pathway\n",
      "N00033\tEGF-EGFR-PI3K signaling pathway\n",
      "N00034\tERBB2-overexpression to PI3K signaling pathway\n",
      "N00035\tAmplified EGFR to PI3K signaling pathway\n",
      "N00036\tMutation-activated EGFR to PI3K signaling pathway\n",
      "N00037\tFGF-FGFR-PI3K signaling pathway\n",
      "N00038\tAmplified FGFR to PI3K signaling pathway\n",
      "N00039\tPDGF-PDGFR-PI3K signaling pathway\n",
      "N00040\tAmplified PDGFR to PI3K signaling pathway\n",
      "N00041\tEGFR-overexpression to RAS-ERK signaling pathway\n",
      "N00042\tEGFR-overexpression to PI3K signaling pathway\n",
      "N00043\tHGF-MET-PI3K signaling pathway\n",
      "N00044\tMutation-activated MET to PI3K signaling pathway\n",
      "N00045\tKITLG-KIT-PI3K signaling pathway\n",
      "N00046\tMutation-activated KIT to PI3K signaling pathway\n",
      "N00047\tEML4-ALK fusion kinase to PI3K signaling pathway\n",
      "N00048\tBCR-ABL fusion kinase to PI3K signaling pathway\n",
      "N00049\tMutation-activated PI3K to PI3K signaling pathway\n",
      "N00050\tAmplified PI3K to PI3K signaling pathway\n",
      "N00051\tDeleted PTEN to PI3K signaling pathway\n",
      "N00052\tMutation-inactivated PTEN to PI3K signaling pathway\n",
      "N00053\tCytokine-Jak-STAT signaling pathway\n",
      "N00054\tDuplication or mutation-activated FLT3 to Jak-STAT signaling pathway\n",
      "N00055\tBCR-ABL fusion kinase to Jak-STAT signaling pathway\n",
      "N00056\tWnt signaling pathway\n",
      "N00057\tMutation-inactivated APC to Wnt signaling pathway\n",
      "N00058\tMutation-activated CTNNB1 to Wnt signaling pathway\n",
      "N00059\tFZD7-overexpression to Wnt signaling pathway\n",
      "N00060\tLRP6-overexpression to Wnt signaling pathway\n",
      "N00061\tCDH1-reduced expression to beta-catenin signaling pathway\n",
      "N00062\tHedgehog signaling pathway\n",
      "N00063\tTGF-beta signaling pathway\n",
      "N00064\tMutation-inactivated TGFBR2 to TGF-beta signaling pathway\n",
      "N00065\tMutation-inactivated SMAD2 to TGF-beta signaling pathway\n",
      "N00066\tMDM2-p21-Cell cycle G1/S\n",
      "N00067\tDeleted p14(ARF) to p21-cell cycle G1/S\n",
      "N00068\tAmplified MDM2 to p21-cell cycle G1/S\n",
      "N00069\tp16-Cell cycle G1/S\n",
      "N00070\tMutation-inactivated p16(INK4a) to p16-cell cycle G1/S\n",
      "N00071\tDeleted p16(INK4a) to p16-cell cycle G1/S\n",
      "N00072\tAmplified CDK4 to cell cycle G1/S\n",
      "N00073\tMutation-activated CDK4 to cell cycle G1/S\n",
      "N00074\tLoss of RB1 to cell cycle G1/S\n",
      "N00075\tMutation-inactivated RB1 to cell cycle G1/S\n",
      "N00076\tMutation-inactivated p14(ARF) to p21-cell cycle G1/S\n",
      "N00077\tHRAS-overexpression to ERK signaling pathway\n",
      "N00078\tMutation-activated HRAS to ERK signaling pathway\n",
      "N00079\tHIF-1 signaling pathway\n",
      "N00080\tLoss of VHL to HIF-1 signaling pathway\n",
      "N00081\tMutation-inactivated VHL to HIF-1 signaling pathway\n",
      "N00082\tLoss of NKX3-1 to PI3K signaling pathway\n",
      "N00083\tAndrogen receptor signaling pathway\n",
      "N00084\tAmplified AR to androgen receptor signaling pathway\n",
      "N00085\tMutation-activated AR to androgen receptor signaling pathway\n",
      "N00086\tNotch signaling pathway\n",
      "N00087\tNOTCH-overexpression to Notch signaling pathway\n",
      "N00088\tAmplified MYC to p15-cell cycle G1/S\n",
      "N00089\tAmplified MYC to cell cycle G1/S\n",
      "N00090\tp15-Cell cycle G1/S\n",
      "N00091\tp27-Cell cycle G1/S\n",
      "N00092\tAmplified MYC to p27-cell cycle G1/S\n",
      "N00093\tLoss of CDKN1B to p27-cell cycle G1/S\n",
      "N00094\tEGF-Jak-STAT signaling pathway\n",
      "N00095\tERBB2-overexpression to EGF-Jak-STAT signaling pathway\n",
      "N00096\tEGF-EGFR-RAS-RASSF1 signaling pathway\n",
      "N00097\tLoss of RASSF1 to RAS-RASSF1 signaling pathway\n",
      "N00098\tIntrinsic apoptotic pathway\n",
      "N00099\tMutation-inactivated BAX to apoptotic pathway\n",
      "N00100\tBCL2-overexpression to intrinsic apoptotic pathway\n",
      "N00101\tDCC-apoptotic pathway\n",
      "N00102\tLoss of DCC to DCC-apoptotic pathway\n",
      "N00103\tEGF-EGFR-RAS-RalGDS signaling pathway\n",
      "N00104\tMutation-activated KRAS to RalGDS signaling pathway\n",
      "N00105\tEML4-ALK fusion kinase to Jak-STAT signaling pathway\n",
      "N00106\tAML1-EVI1 fusion to TGF-beta signaling pathway\n",
      "N00107\tEVI-1 overexpression to TGF-beta signaling pathway\n",
      "N00108\tAML1-ETO fusion to transcriptional activtion\n",
      "N00109\tPML-RARA fusion to transcriptional activtion\n",
      "N00110\tPLZF-RARA fusion to transcriptional activtion\n",
      "N00111\tAML1-ETO fusion to CEBPA-mediated transcription\n",
      "N00112\tAML1-ETO fusion to PU.1-mediated transcription\n",
      "N00113\tPML-RARA fusion to transcriptional repression\n",
      "N00114\tPLZF-RARA fusion to transcriptional repression\n",
      "N00115\tMutation-inactivated TP53 to transcription\n",
      "N00116\tMutation-inactivated RUNX1 to transcription\n",
      "N00117\tE2A-PBX1 fusion to transcriptional activation\n",
      "N00118\tTEL-AML1 fusion to transcriptional repression\n",
      "N00119\tMLL-AF4 fusion to transcriptional activation\n",
      "N00120\tMLL-ENL fusion to transcriptional activation\n",
      "N00121\tLMO2-rearrangement to transcriptional activation\n",
      "N00122\tLMO2-rearrangement to transcriptional repression\n",
      "N00123\tAmplified REL to transcription\n",
      "N00124\tIGH-MAF fusion to transcriptional activation\n",
      "N00125\tIGH-MMSET fusion to transcriptional activation\n",
      "N00126\tPAX8-PPARG fusion to PPARG-mediated transcription\n",
      "N00127\tPRCC-TFE3 fusion to transcriptional activation\n",
      "N00128\tTMPRSS2-ERG fusion to transcriptional activation\n",
      "N00129\tTMPRSS2-ERG fusion to transcriptional repression\n",
      "N00130\tTMPRSS2-ETV5 fusion to transcriptional activation\n",
      "N00131\tAmplified MYCN to transcriptional activation\n",
      "N00132\tAmplified MYCN to transcriptional repression\n",
      "N00133\tEWSR1-FLI1 fusion to transcriptional activation\n",
      "N00134\tEWSR1-FLI1 fusion to transcriptional repression\n",
      "N00135\tEWSR1-ERG fusion to transcriptional activation\n",
      "N00136\tEWSR1-ATF1 fusion to transcriptional activation\n",
      "N00137\tEWSR1-WT1 fusion to transcriptional activation\n",
      "N00138\tEWSR1-NR4A3\n",
      "N00139\tFUS-DDIT3 fusion to CEBPB-mediated transcription\n",
      "N00140\tFUS-DDIT3 fusion to NFKB-mediated transcription\n",
      "N00141\tPAX3-FOXO1 fusion to transcriptional activation\n",
      "N00142\tSYT-SSX fusion to transcriptional repression\n",
      "N00143\tASPL-TFE3 fusion to transcriptional activation\n",
      "N00144\tTLX1 rearrangement to transcriptional repression\n",
      "N00145\tExtrinsic apoptotic pathway\n",
      "N00146\tCrosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00147\tEGF-EGFR-PLCG-calcineurin signaling pathway\n",
      "N00148\tTLR3-IRF7 signaling pathway\n",
      "N00149\tTLR3-IRF3 signaling pathway\n",
      "N00150\tType I IFN signaling pathway\n",
      "N00151\tTNF-NFKB signaling pathway\n",
      "N00152\tCXCR-GNB/G-ERK signaling pathway\n",
      "N00153\tCCR/CXCR-GNB/G-PI3K-RAC signaling pathway\n",
      "N00154\tCXCR-GNB/G-PI3K-AKT signaling pathway\n",
      "N00155\tAutophagy-vesicle nucleation/elongation/maturation, mTORC1-PI3KC3-C1\n",
      "N00156\tAutophagy-vesicle nucleation/elongation/maturation, LC3-II formation\n",
      "N00157\tKSHV vGPCR to GNB/G-ERK signaling pathway\n",
      "N00158\tKSHV vGPCR to GNB/G-PI3K-AKT signaling pathway\n",
      "N00159\tKSHV K1 to PI3K signaling pathway\n",
      "N00160\tKSHV K1 to RAS-ERK signaling pathway\n",
      "N00161\tKSHV vIRF1/2 to TLR3-IRF3 signaling pathway\n",
      "N00162\tKSHV vIRF3 to TLR3-IRF7 signaling pathway\n",
      "N00163\tKSHV KIE1/2 to TLR3-IRF7 signaling pathway\n",
      "N00164\tKSHV vBCL2 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00165\tKSHV vIAP to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00166\tKSHV vFLIP to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00167\tKSHV vIRF1/3 to p21-cell cycle G1/S\n",
      "N00168\tKSHV vCyclin to cell cycle G1/S\n",
      "N00169\tKSHV LANA to p21-cell cycle G1/S\n",
      "N00170\tKSHV LANA to cell cycle G1/S\n",
      "N00171\tKSHV vFLIP to NFKB signaling pathway\n",
      "N00172\tKSHV K15 to PLCG-calcineurin signaling pathway\n",
      "N00173\tKSHV K15 to TNF-NFKB signaling pathway\n",
      "N00174\tKSHV vFLIP to TNF-NFKB signaling pathway\n",
      "N00175\tKSHV LANA to Wnt signaling pathway\n",
      "N00176\tKSHV vFLIP to autophagy-vesicle elongation\n",
      "N00177\tKSHV vBCL2 to autophagy-vesicle nucleation\n",
      "N00178\tKSHV vGPCR to GNB/G-PI3K-JNK signaling pathway\n",
      "N00179\tKSHV K1 to PI3K-NFKB signaling pathway\n",
      "N00180\tKSHV K1 to PLCG-calcineurin signaling pathway\n",
      "N00181\tKSHV vIL-6 to Jak-STAT signaling pathway\n",
      "N00182\tIGF-IGFR-PI3K-NFKB signaling pathway\n",
      "N00184\tKSHV MIR1/2 to antigen processing and presentation by MHC class I molecules\n",
      "N00185\tKSHV MIR2 to cell surface molecule-endocytosis\n",
      "N00186\tIL1-IL1R-p38 signaling pathway\n",
      "N00187\tKSHV Kaposin B to p38 signaling pathway\n",
      "N00188\tIL1-IL1R-JNK signaling pathway\n",
      "N00189\tKSHV K15 to JNK signaling pathway\n",
      "N00212\tKSHV vCCL2 to CCR signaling pathway\n",
      "N00213\tKSHV Kaposin to alternative pathway of complement cascade\n",
      "N00215\tKITLG-KIT-RAS-ERK signaling pathway\n",
      "N00216\tHGF-MET-RAS-ERK signaling pathway\n",
      "N00217\tFLT3LG-FLT3-RAS-ERK signaling pathway\n",
      "N00218\tFLT3LG-FLT3-RAS-PI3K signaling pathway\n",
      "N00219\tFLT3LG-FLT3-STAT5 signaling pathway\n",
      "N00220\tPTEN-PIP3-AKT signaling pathway\n",
      "N00221\tHTLV-1 Tax to spindle assembly checkpoint signaling\n",
      "N00222\tHTLV-1 Tax to spindle assembly checkpoint signaling\n",
      "N00223\tEBV EBNA1 to p53-mediated transcription\n",
      "N00224\tEBV EBNALP RBP-Jk-mediated transcription\n",
      "N00225\tEBV EBNA2 to RBP-Jk-mediated transcription\n",
      "N00226\tEBV EBNA3A/3B/3C to RBP-Jk-mediated transcription\n",
      "N00227\tTGFA-EGFR-PLCG-PKC signaling pathway\n",
      "N00228\tTGFA-overexpression to PLCG-PKC signaling pathway\n",
      "N00229\tTGFA-EGFR-RAS-ERK signaling pathway\n",
      "N00230\tTGFA-overexpression to RAS-ERK signaling pathway\n",
      "N00231\tTGFA-EGFR-PI3K signaling pathway\n",
      "N00232\tTGFA-overexpression to PI3K signaling pathway\n",
      "N00233\tIGF-IGF1R-RAS-ERK signaling pathway\n",
      "N00234\tIGF2-IGF1R-PI3K signaling pathway\n",
      "N00235\tIGF2-overexpression to RAS-ERK signaling pathway\n",
      "N00236\tIGF2-overexpression to PI3K signaling pathway\n",
      "N00237\tIGF1R-overexpression to RAS-ERK signaling pathway\n",
      "N00238\tIGF1R-overexpression to PI3K signaling pathway\n",
      "N00239\tTelomerase activity\n",
      "N00240\tTERT-overexpression to telomerase activity\n",
      "N00241\tTGFBR2-reduced expression to TGF-beta signaling pathway\n",
      "N00242\tMutation-inactivated AXIN to Wnt signaling pathway\n",
      "N00243\tKEAP1-NRF2 signaling pathway\n",
      "N00244\tMutation-inactivated KEAP1 to KEAP1-NRF2 signaling pathway\n",
      "N00245\tMutation-activated NRF2 to KEAP1-NRF2 signaling pathway\n",
      "N00246\tHGF-overexpression to RAS-ERK signaling pathway\n",
      "N00247\tHGF-overexpression to PI3K signaling pathway\n",
      "N00248\tMET-overexpression to RAS-ERK signaling pathway\n",
      "N00249\tMET-overexpression to PI3K signaling pathway\n",
      "N00250\tCDX2-overexpression to transcriptional activation\n",
      "N00251\tCDX2-overexpression to transcriptional repression\n",
      "N00252\tAmplified ERBB2 to RAS-ERK signaling pathway\n",
      "N00253\tAmplified ERBB2 to PI3K signaling pathway\n",
      "N00254\tCDKN1B-reduced expression to p27-cell cycle G1/S\n",
      "N00255\tAmplified CCNE to cell cycle G1/S\n",
      "N00256\tTGFBR1-reduced expression to TGF-beta signaling pathway\n",
      "N00257\tLoss of CDH1 to beta-catenin signaling pathway\n",
      "N00258\tMutation-inactivated CDH1 to beta-catenin signaling pathway\n",
      "N00259\tAmplified MET to RAS-ERK signaling pathway\n",
      "N00260\tAmplified MET to PI3K signaling pathway\n",
      "N00261\tKSHV vIRF2 to IFN signaling pathway\n",
      "N00262\tEBV EBNA3C to intrinsic apoptotic pathway\n",
      "N00263\tEBV EBNA3C to p53-mediated transcription\n",
      "N00264\tEBV EBNA3C to p27-Cell cycle G1/S\n",
      "N00265\tEBV LMP1 to NFKB signaling pathway\n",
      "N00266\tEBV LMP2A to PI3K signaling pathway\n",
      "N00267\tHBV HBx to PI3K signaling pathway\n",
      "N00268\tHBV HBx to RIG-I-like receptor signaling pathway\n",
      "N00269\tHCV core to TNF-NFKB signaling pathway\n",
      "N00270\tHCV Core to IFN signaling pathway\n",
      "N00271\tHCV NS3/4A to RIG-I-like receptor signaling pathway\n",
      "N00272\tHCV NS5A to PI3K signaling pathway\n",
      "N00273\tHCV NS5A to oligoadenylate synthetase (OAS)/RNase L pathway\n",
      "N00274\tHCV NS5A to RAS-ERK signaling pathway\n",
      "N00275\tAmplified CCND1 to cell cycle G1/S\n",
      "N00276\tEGF-overexpression to RAS-ERK signaling pathway\n",
      "N00277\tEREG-EGFR-RAS-ERK signaling pathway\n",
      "N00278\tEREG-overexpression to RAS-ERK signaling pathway\n",
      "N00279\tAREG-EGFR-RAS-ERK signaling pathway\n",
      "N00280\tAREG-overexpression to RAS-ERK signaling pathway\n",
      "N00281\tEGF-overexpression to PI3K signaling pathway\n",
      "N00282\tEREG-EGFR-PI3K signaling pathway\n",
      "N00283\tEREG-overexpression to PI3K signaling pathway\n",
      "N00284\tAREG-EGFR-PI3K signaling pathway\n",
      "N00285\tAREG-overexpression to PI3K signaling pathway\n",
      "N00286\tNuclear-initiated estrogen signaling pathway\n",
      "N00287\tESR1-positive to nuclear-initiated estrogen signaling pathway\n",
      "N00288\tPTH-PTH1R-PKA signaling pathway\n",
      "N00290\tMutation-inactivated MEN1 to transcription\n",
      "N00291\tCaSR-PTH signaling pathway\n",
      "N00293\tGCM2-mediated transcription\n",
      "N00297\tACTH-cortisol signaling pathway\n",
      "N00298\tCYP11B1-CYP11B2 fusion to ACTH-cortisol signaling pathway\n",
      "N00301\tAngiotensin-aldosterone signaling pathway\n",
      "N00302\tMutation-activated CACNA1D/H to angiotensin-aldosterone signaling pathway\n",
      "N00303\tMutation-activated KCNJ5 to angiotensin-aldosterone signaling pathway\n",
      "N00304\tMutation-inactivated ATP1A1 to angiotensin-aldosterone signaling pathway\n",
      "N00305\tMutation-inactivated ATP2B3 to angiotensin-aldosterone signaling pathway\n",
      "N00306\tSF-1-mediated transcription\n",
      "N00309\tCortisone reduction\n",
      "N00311\tNADPH generation\n",
      "N00313\tTransport of cortisol\n",
      "N00315\tMutation-inactivated AIP to AhR-mediated transcription\n",
      "N00316\tMutation-inactivated CDKN1B to p27-cell cycle G1/S\n",
      "N00317\tAhR signaling pathway\n",
      "N00318\tEGFR-ERK-ACTH signaling pathway\n",
      "N00319\tMutation-activated USP8 to EGFR-ERK-ACTH signaling pathway\n",
      "N00320\tMutation-activated PRKACA to ACTH-cortisol signaling pathway\n",
      "N00321\tMutation-activated GNAS to ACTH-cortisol signaling pathway\n",
      "N00322\tMutation-inactivated PRKAR1A to ACTH-cortisol signaling pathway\n",
      "N00323\tMutation-inactivated PDE11A/PDE8B to ACTH-cortisol signaling pathway\n",
      "N00324\tCRHR-PKA-ACTH signaling pathway\n",
      "N00325\tMutation-inactivated RASD1 to CRHR-PKA-ACTH signaling pathway\n",
      "N00326\tMutation-activated GNAS to CRHR-PKA-ACTH signaling pathway\n",
      "N00327\tMutation-inactivated PRKAR1A to CRHR-PKA-ACTH signaling pathway\n",
      "N00332\tVesicular uptake of lipoproteins\n",
      "N00336\tPCSK9-mediated LDLR degradation\n",
      "N00338\tSteroid hormone biosynthesis, progesterone to cortisol/cortisone\n",
      "N00339\tSteroid hormone biosynthesis, progesterone to aldosterone\n",
      "N00340\tThe Scribble/Dlg/Lgl polarity module\n",
      "N00341\tHPV E6 to the Scribble/Dlg/Lgl polarity module\n",
      "N00342\tMAGI-PTEN signaling pathway\n",
      "N00343\tHPV E6 to MAGI-PTEN signaling pathway\n",
      "N00344\tCRB3-Pals1-PATJ complex\n",
      "N00345\tHPV E6 to CRB3-Pals1-PATJ complex\n",
      "N00346\tHPV E6 to TLR-IRF3 signaling pathway\n",
      "N00347\tp300-p21-Cell cycle G1/S\n",
      "N00348\tHPV E6 to p300-p21-Cell cycle G1/S\n",
      "N00349\tHPV E6 to p300-p21-Cell cycle G1/S\n",
      "N00350\tHPV E6 to extrinsic apoptotic pathway\n",
      "N00351\tHPV E6 to extrinsic apoptotic pathway\n",
      "N00352\tHPV E6 to extrinsic apoptotic pathway\n",
      "N00353\tHPV E6 to PTEN-PIP3-AKT signaling pathway\n",
      "N00354\tHPV E6 to PTEN-PIP3-AKT signaling pathway\n",
      "N00355\tPP2A-AKT signaling pathway\n",
      "N00356\tHPV E7 to PP2A-AKT signaling patyway\n",
      "N00357\tHPV E6 to MTOR signaling pathway\n",
      "N00358\tHPV E6 to p21-cell cycle G1/S\n",
      "N00359\tHPV E7 to p27-cell cycle G1/S\n",
      "N00360\tHPV E7 to p27-cell cycle G1/S\n",
      "N00361\tHPV E7 to cell cycle G1/S\n",
      "N00362\tHPV E5 to p21-cell cycle G1/S\n",
      "N00363\tAntigen processing and presentation by MHC class I molecules\n",
      "N00364\tHPV E5 to antigen processing and presentation by MHC class I molecules\n",
      "N00365\tHPV E7 to cell cycle G1/S\n",
      "N00366\tHPV E5 to EGFR-PI3K signaling pathway\n",
      "N00367\tHPV E5 to EGFR-RAS-ERK signaling pathway\n",
      "N00368\tHPV E5 to PDGFR-PI3K signaling pathway\n",
      "N00369\tHPV E5 to PDGFR-RAS-ERK signaling pathway\n",
      "N00370\tPyruvate generation\n",
      "N00371\tHPV E7 to pyruvate generation\n",
      "N00372\tHPV E7 to p300-p21-Cell cycle G1/S\n",
      "N00373\tHPV E6 to NFX1-mediated transcription\n",
      "N00374\tTNF-IRF1 signaling pathway\n",
      "N00375\tHPV E7 to TNF-IRF1 signaling pathway\n",
      "N00376\tHPV E7 to TBP1-mediated transcription\n",
      "N00377\tHPV E6 to IFN signaling pathway\n",
      "N00378\tHPV E6 to IFN signaling pathway\n",
      "N00379\tHPV E7 to IFN signaling pathway\n",
      "N00380\tHPV E6 to Notch signaling pathway\n",
      "N00381\tHPV E6 to Notch signaling pathway\n",
      "N00382\tHPV E6 to Notch signaling pathway\n",
      "N00383\tHPV E6 to intrinsic apoptotic pathway\n",
      "N00384\tHPV E6 to intrinsic apoptotic pathway\n",
      "N00385\tHCMV gB to PDGFR-PI3K signaling pathway\n",
      "N00386\tHCMV gB to PDGFR-RAS-ERK signaling pathway\n",
      "N00387\tHCMV IE1-72/IE2-86 to PI3K signaling pathway\n",
      "N00388\tHCMV UL38 to MTOR signaling pathway\n",
      "N00389\tHCMV IE1-72 to transcription\n",
      "N00390\tEGF-EGFR-PI3K-NFKB signaling pathway\n",
      "N00391\tHCMV gB to EGFR-PI3K-NFKB signaling pathway\n",
      "N00392\tHCMV gB to EGFR-RAS-ERK signaling pathway\n",
      "N00393\tITGA/B-RhoGAP-RhoA signaling pathway\n",
      "N00394\tHCMV gH to ITGA/B-RhoA signaling pathway\n",
      "N00395\tcGAS-STING signaling pathway\n",
      "N00396\tHCMV UL82 to cGAS-STING signaling pathway\n",
      "N00397\tHCMV UL26 to NFKB signaling pathway\n",
      "N00398\tHCMV IE2-86 to TNF-NFKB signaling pathway\n",
      "N00399\tCCR2-GNB/G-PI3K-NFKB signaling pathway\n",
      "N00400\tHCMV US28 to GNB/G-PI3K-NFKB signaling pathway\n",
      "N00401\tCXCR4-GNAQ-PLCB/G-calcineurin signaling pathway\n",
      "N00402\tHCMV US28 to GNAQ-PLCB/G-calcineurin signaling pathway\n",
      "N00403\tCX3CR1-GNAI-AC-PKA signaling pathway\n",
      "N00404\tHCMV US28 to GNAI-AC-PKA signaling pathway\n",
      "N00405\tCXCR4-GNA12/13-Rho signaling pathway\n",
      "N00406\tHCMV US28 to GNA12/13-Rho signaling pathway\n",
      "N00407\tHCMV UL33 to GNAQ-PLCB/G-calcineurin signaling pathway\n",
      "N00408\tLPAR-GNB/G-Rho signaling pathway\n",
      "N00409\tHCMV UL33 to GNB/G-Rho signaling pathway\n",
      "N00410\tDRD1-GNAS-AC-PKA signaling pathway\n",
      "N00411\tHCMV UL33 to GNAS-AC-PKA signaling pathway\n",
      "N00412\tHCMV UL33 to GNAI-AC-PKA signaling pathway\n",
      "N00413\tCXCR4-GNB/G-PLCB-PKC signaling pathway\n",
      "N00414\tHCMV US27 to CXCR4-GNB/G-PLCB-PKC signaling pathway\n",
      "N00415\tIL10 family to Jak-STAT signaling pathway\n",
      "N00416\tHCMV vIL10 to IL10-JAK-STAT signaling pathway\n",
      "N00417\tHCMV US6 to antigen processing and presentation by MHC class I molecules\n",
      "N00418\tHCMV US2/11 to antigen processing and presentation by MHC class I molecules\n",
      "N00419\tHCMV US3/10 to antigen processing and presentation by MHC class I molecules\n",
      "N00420\tHCMV IE2-86 to p21-cell cycle G1/S\n",
      "N00421\tHCMV IE2-86 to p21-cell cycle G1/S\n",
      "N00422\tHCMV IE2-86 to cell cycle G1/S\n",
      "N00423\tHCMV IE1-72 to cell cycle G1/S\n",
      "N00424\tHCMV pp71 to cell cycle G1/S\n",
      "N00425\tHCMV UL36 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00426\tHCMV UL37x1 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00427\tHCMV vCXCL to CXCR-GNB/G-PI3K-AKT signaling pathway\n",
      "N00428\tCCR5-GNB/G-PLCB/G-PKC signaling pathway\n",
      "N00429\tHCMV UL22A to CCR5-GNB/G-PLCB/G-PKC signaling pathway\n",
      "N00430\tCXCR4-GNAI-PI3K-BAD signaling pathway\n",
      "N00431\tHIV gp120 to CXCR4-GNAI-PI3K-BAD signaling pathway\n",
      "N00432\tHIV gp120 to CXCR4-GNAQ-PLCB/G-calcineurin\n",
      "N00433\tCXCR4-GNB/G-RAC signaling pathway\n",
      "N00434\tHIV gp120 to CXCR4-GNB/G-RAC signaling pathway\n",
      "N00435\tTLR1/2/4-NFKB signaling pathway\n",
      "N00436\tHIV Tat to TLR2/4-NFKB signaling pathway\n",
      "N00437\tHIV Vpu to TLR2/4-NFKB signaling pathway\n",
      "N00438\tTLR2/4-MAPK signaling pathway\n",
      "N00439\tHIV Nef to TLR2/4-MAPK signaling pathway\n",
      "N00440\tHIV Vpu/Vif/Vpr to cGAS-STING signaling pathway\n",
      "N00441\tHIV gp120 to TNF-NFKB signaling pathway\n",
      "N00442\tHIV Nef to TNF-NFKB signaling pathway\n",
      "N00443\tHIV Vpr/Nef/Tat to TNF-NFKB signaling pathway\n",
      "N00444\tTNF-p38 signaling pathway\n",
      "N00445\tHIV Tat/Nef to TNF-p38 signaling pathway\n",
      "N00446\tTNF-JNK signaling pathway\n",
      "N00447\tHIV Vpr/Tat to TNF-JNK signaling pathway\n",
      "N00448\tHIV Tat/Nef to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00449\tHIV Tat/Nef to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00450\tHIV Tat to intrinsic apoptotic pathway\n",
      "N00451\tHIV Tat to intrinsic apoptotic pathway\n",
      "N00452\tHIV Nef to intrinsic apoptotic pathway\n",
      "N00453\tHIV Vpr to intrinsic apoptotic pathway\n",
      "N00454\tHIV Vpr to intrinsic apoptotic pathway\n",
      "N00455\tCDC25-Cell cycle G2/M\n",
      "N00456\tHIV Vpr to CDC25-cell cycle G2M\n",
      "N00457\tHIV Vpr to cell cycle G2M\n",
      "N00458\tHIV Vpr to CDC25-cell cycle G2M\n",
      "N00459\tWEE1-Cell cycle G2/M\n",
      "N00460\tHIV Vpr to WEE1-cell cycle G2M\n",
      "N00461\tHIV Nef to antigen processing and presentation by MHC class I molecules\n",
      "N00462\tKSHV vCCL1/2/3 to CCR signaling pathway\n",
      "N00465\tDeleted DMD to dystrophin-associated protein complex\n",
      "N00466\tEBV BPLF1 to TLR2/4-NFKB signaling pathway\n",
      "N00467\tEBV BPLF1 to TLR2/4-NFKB signaling pathway\n",
      "N00468\tEBV BPLF1 to TLR2/4-NFKB signaling pathway\n",
      "N00469\tRIG-I-IRF7/3 signaling pathway\n",
      "N00470\tEBV BGLF4 to RIG-I-like receptor signaling pathway\n",
      "N00471\tEBV LMP2A/2B to IFN signaling pathway\n",
      "N00472\tEBV LMP1 to IFN signaling pathway\n",
      "N00473\tEBV BGLF4 to IFN signaling pathway\n",
      "N00474\tEBV BHRF1 to intrinsic apoptotic pathway\n",
      "N00475\tEBV BHRF1 to intrinsic apoptotic pathway\n",
      "N00476\tEBV BHRF1 to intrinsic apoptotic pathway\n",
      "N00477\tEBV BHRF1 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00478\tEBV BARF1 to intrinsic apoptotic pathway\n",
      "N00479\tEBV BNLF2a to antigen processing and presentation by MHC class I molecules\n",
      "N00480\tEBV BILF1 to antigen processing and presentation by MHC class I molecules\n",
      "N00481\tEBV BZLF1 to p53-mediated transcription\n",
      "N00482\tEBV EBNA3C to p27-Cell cycle G1/S\n",
      "N00483\tEBV EBNA3C to cell cycle G1/S\n",
      "N00484\tEBV EBNA3C to cell cycle G1/S\n",
      "N00485\tEBV LMP1 to PI3K signaling pathway\n",
      "N00486\tEBV LMP1 to Jak-STAT signaling pathway\n",
      "N00487\tBCR-PLCG-Calcineurin signaling pathway\n",
      "N00488\tEBV LMP2A to BCR signaling pathway\n",
      "N00489\tHTLV-1 p30II to c-myc-mediated transcription\n",
      "N00490\tHTLV-1 p12 to calcineurin signaling pathway\n",
      "N00491\tHTLV-1 p12 to Jak-STAT signaling pathway\n",
      "N00492\tHTLV-1 p12 to antigen processing and presentation by MHC class I molecules\n",
      "N00493\tSpindle assembly checkpoint signaling\n",
      "N00494\tHTLV-1 Tax to p16-cell cycle G1/S\n",
      "N00495\tHTLV-1 Tax to p15-cell cycle G1/S\n",
      "N00497\tHTLV-1 Tax to p21-cell cycle G1/S\n",
      "N00498\tHTLV-1 Tax to p21-cell cycle G1/S\n",
      "N00499\tATR-p21-Cell cycle G2/M\n",
      "N00500\tHTLV-1 Tax to p21-cell cycle G2/M\n",
      "N00501\tHTLV-1 Tax to EGFR-PI3K-NFKB signaling pathway\n",
      "N00502\tHTLV-1 Tax to PTEN-PIP3-AKT signaling pathway\n",
      "N00503\tHTLV-1 Tax to TNF-JNK signaling pathway\n",
      "N00504\tHTLV-1 Tax to NFKB signaling pathway\n",
      "N00505\tCD40-NFKB signaling pathway\n",
      "N00506\tHTLV-1 Tax to CD40-NFKB signaling pathway\n",
      "N00507\tHTLV-1 Tax to TGF-beta signaling pathway\n",
      "N00508\tHTLV-1 Tax to NFY-mediated transcription\n",
      "N00509\tHTLV-1 Tax to SRF-mediated transcription\n",
      "N00510\tHTLV-1 Tax to CREB-mediated transcription\n",
      "N00511\tHTLV-1 Tax to E47-mediated transcription\n",
      "N00512\tHTLV-1 Tax to c-myc-mediated transcription\n",
      "N00513\tMutation-activated EGFR to RAS-ERK signaling pathway\n",
      "N00514\tMutation-activated EGFR to PI3K signaling pathway\n",
      "N00515\tOligoadenylate synthetase (OAS)/RNase L pathway\n",
      "N00516\tHCV NS3/4A to TLR3-IRF3 signaling pathway\n",
      "N00517\tHCV NS3/4A to TLR3-IRF3 signaling pathway\n",
      "N00518\tHCV Core to ERK signaling pathway\n",
      "N00519\tHCV Core to ERK signaling pathway\n",
      "N00520\tHCV NS5A to p21-cell cycle G1/S\n",
      "N00521\tHCV Core to p21-cell cycle G1/S\n",
      "N00522\tHCV NS3 to p21-cell cycle G1/S\n",
      "N00523\tHCV Core to p21-cell cycle G1/S\n",
      "N00524\tHCV NS5A to extrinsic apoptotic pathway\n",
      "N00525\tHCV NS5A to TNF-NFKB signaling pathway\n",
      "N00526\tHCV NS3 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00527\tHCV Core to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00528\tHCV core to extrinsic apoptotic pathway\n",
      "N00529\tHCV core to RXRA/PPARA-mediated transcription\n",
      "N00530\tHCV core to RXRA/LXRA-mediated transcription\n",
      "N00531\tHBV HBx to TGF-beta signaling pathway\n",
      "N00532\tHBV HBx to Egr-mediated transcription\n",
      "N00533\tHBV HBx to Crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00534\tHBV HBx to Crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00535\tHBV HBx to p53-mediated transcription\n",
      "N00536\tMDM2-p21-Cell cycle G1/S\n",
      "N00537\tHBV HBx to cell cycle G1/S\n",
      "N00538\tCa2+-PYK2-RAS-ERK signaling pathway\n",
      "N00539\tHBV HBx to Ca2+-PYK2-RAS-ERK signaling pathway\n",
      "N00540\tHBV HBx to RAS-ERK signaling pathway\n",
      "N00541\tHBV HBx to RAS-ERK signaling pathway\n",
      "N00542\tEGF-EGFR-RAS-JNK signaling pathway\n",
      "N00543\tHBV HBx to JNK signaling pathway\n",
      "N00544\tHBV HBx to CREB-mediated transcription\n",
      "N00545\tHBV HBx to ERK signaling pathway\n",
      "N00546\tCXCL12-CXCR4-PKC-ERK signaling pathaway\n",
      "N00547\tHBV LHBs to PKC-ERK signaling pathway\n",
      "N00548\tHBV HBx to Jak-STAT signaling pathway\n",
      "N00549\tHBV HBeAg to TLR2/4-NFKB signaling pathway\n",
      "N00550\tHBV HBeAg to TLR2/4-NFKB signaling pathway\n",
      "N00551\tHBV HBs to TLR2/4-MAPK signaling pathway\n",
      "N00552\tHBV pol to TLR3-IRF3 signaling pathway\n",
      "N00553\tTLR4-IRF3/7 signaling pathway\n",
      "N00554\tHBV HBe to TLR4-IRF3/7 signaling pathway\n",
      "N00555\tHBV HBe to TLR4-IRF3/7 signaling pathway\n",
      "N00556\tHBV HBe to TLR2/4-NFKB signaling pathway\n",
      "N00557\tHBV HBe to TLR2/4-NFKB signaling pathway\n",
      "N00558\tHBV pol to IFN signaling pathway\n",
      "N00559\tLIGHT-HVEM-NFKB signaling pathway\n",
      "N00560\tHSV gD to HVEM-NFKB signaling pathway\n",
      "N00561\tHSV ICP0 to TLR2/4-NFKB signaling pathway\n",
      "N00562\tHSV US3 to TLR2/4-NFKB signaling pathway\n",
      "N00563\tTLR3-NFKB signaling pathway\n",
      "N00564\tHSV US3 to TLR3-NFKB signaling pathway\n",
      "N00565\tHSV US11 to RIG-I-like receptor signaling pathway\n",
      "N00566\tHSV UL36USP to RIG-I-like receptor signaling pathway\n",
      "N00567\tHSV ICP34.5 to TBK1 signaling pathway\n",
      "N00568\tHSV US3 to IRF3 signaling pathway\n",
      "N00569\tHSV UL41 to cGAS-STING signaling pathway\n",
      "N00570\tHSV ICP0 to cGAS-STING signaling pathway\n",
      "N00571\tPKR-eIF2alpha signaling pathway\n",
      "N00572\tHSV ICP34.5 to PKR-eIF2alpha signaling pathway\n",
      "N00573\tHSV US11 to PKR-eIF2alpha signaling pathway\n",
      "N00574\tHSV US11 to oligoadenylate synthetase (OAS)/RNase L pathway\n",
      "N00575\tHSV ICP27 to IFN signaling pathway\n",
      "N00576\tHSV UL41/UL13 to IFN signaling pathway\n",
      "N00577\tHSV UL41 to IFN signaling pathway\n",
      "N00578\tHSV UL41 to IFN signaling pathway\n",
      "N00579\tHSV ICP6 to extrinsic apoptotic pathway\n",
      "N00580\tHSV ICP0 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00581\tHSV ICP47 to antigen processing and presentation by MHC class I molecules\n",
      "N00582\tIGF-IGF1R-PI3K signaling pathway\n",
      "N00583\tHSV VP11/12 to PI3K signaling pathway\n",
      "N00584\tHSV US3 to MTOR signaling pathway\n",
      "N00585\tHSV US3 to intrinsic apoptotic pathway\n",
      "N00586\tNuclear export of mRNA\n",
      "N00587\tHSV ICP27 to Nuclear export of mRNA\n",
      "N00588\tHSV VP16 to Oct-1-mediated transcription\n",
      "N00589\tHSV gC to alternative pathway of complement cascade\n",
      "N00590\tAntigen processing and presentation by MHC class II molecules\n",
      "N00591\tHSV gB to antigen processing and presentation by MHC class II molecules\n",
      "N00592\tHSV ICP0 to p53-mediated transcription\n",
      "N00593\tUrea cycle\n",
      "N00599\tObligate allosteric activation of CPS1 by NAG\n",
      "N00600\tNAGS deficiency in urea cycle\n",
      "N00601\tHeme biosynthesis\n",
      "N00610\tDermatan sulfate degradation\n",
      "N00615\tHeparan sulfate degradation\n",
      "N00623\tKeratan sulfate degradation\n",
      "N00627\tMannose type O-glycan biosynthesis, POMT to POMK\n",
      "N00640\tHydrolysis of lactosylceramide\n",
      "N00642\tSaposin stimulation of GBA and GALC\n",
      "N00643\tLoss of saposin stimulation\n",
      "N00644\tHydrolysis of galabiosylceramide\n",
      "N00647\tHydrolysis of galactosylceramide sulfate\n",
      "N00649\tHydrolysis of sphingomyelin\n",
      "N00653\tN-Glycan precursor biosynthesis, ALG7 to ALG11\n",
      "N00667\tN-Glycan precursor biosynthesis, Glc-6P to Man-P-Dol\n",
      "N00673\tN-Glycan precursor biosynthesis, Glc-6P to UDP-Glu\n",
      "N00675\tN-Glycan precursor biosynthesis, farnesy-PP to P-Dol\n",
      "N00679\tGlucosylceramide synthesis in GBA deficiency\n",
      "N00680\tN-Glycan precursor biosynthesis, ALG3 to ALG9\n",
      "N00681\tN-Glycan precursor biosynthesis, ALG6 to OST\n",
      "N00682\tN-Glycan precursor biosynthesis, P-Dol to Glc-P-Dol\n",
      "N00683\tCD80/CD86-CD28-PI3K signaling pathway\n",
      "N00684\tMV F/H to CD28-PI3K signaling pathway\n",
      "N00685\tMV V to RIG-I-IRF7/3 signaling pathway\n",
      "N00686\tMV N to RIG-I-IRF7/3 signaling pathway\n",
      "N00687\tMV V/C to RIG-I-IRF7/3 signaling pathway\n",
      "N00688\tRIG-I-NFKB signaling pathway\n",
      "N00689\tMV V/P/C to RIG-I-NFKB signaling pathway\n",
      "N00690\tTLR7/9-IRF7 signaling pathway\n",
      "N00691\tMV V to TLR7/9-IRF7 signaling pathway\n",
      "N00692\tMV P to TLR2/4-NFKB signaling pathway\n",
      "N00693\tMV V/P to IFN signaling pathway\n",
      "N00694\tMV V/P/C to IFN signaling pathway\n",
      "N00695\tMV V to p73-mediated transcription\n",
      "N00696\tMV C to PKR-eIF2alpha signaling pathway\n",
      "N00697\tHV P to p53-mediated transcription\n",
      "N00698\tMannose type O-glycan biosynthesis, Rib-ol-5P to CDP-Rib-ol\n",
      "N00699\tMannose type O-glycan biosynthesis, FKTN to LARGE\n",
      "N00700\tTyrosine biosynthesis\n",
      "N00702\tTetrahydrobiopterin biosynthesis, GTP to BH4\n",
      "N00705\tTetrahydrobiopterin biosynthesis, BH4OH to BH4\n",
      "N00708\tTyrosine degradation\n",
      "N00713\tGlycogen biosynthesis\n",
      "N00718\tGlycogen degradation\n",
      "N00720\tGlycogen degradation (amylase)\n",
      "N00724\tIAV NS1 to oligoadenylate synthetase (OAS)/RNase L pathway\n",
      "N00725\tIAV NS1 to PKR-eIF2alpha signaling pathway\n",
      "N00726\tIAV NP to PKR-eIF2alpha signaling pathway\n",
      "N00727\tIAV NS1 to RIG-I-like receptor signaling pathway\n",
      "N00728\tIAV NS1 to RIG-I-like receptor signaling pathway\n",
      "N00729\tIAV NS1 to RIG-I-like receptor signaling pathway\n",
      "N00730\tIAV NS1 to RIG-I-like receptor signaling pathway\n",
      "N00731\tGlycolysis\n",
      "N00732\tIAV PB1-F2/PB2 to RIG-I-like receptor signaling pathway\n",
      "N00734\tIAV PB1-F2/PB2 to RIG-I-like receptor signaling pathway\n",
      "N00736\tIAV NS1 to PI3K signaling pathway\n",
      "N00738\tIAV NS1 to IFN signaling pathway\n",
      "N00741\tIAV M2 to cell cycle G1/S\n",
      "N00742\tNLRP3 inflammasome signaling pathway\n",
      "N00743\tIAV NS1 to NLRP3 inflammasome signaling pathway\n",
      "N00744\tIAV HA to ERK signaling pathway\n",
      "N00745\tIAV PB1-F2 to intrinsic apoptotic pathway\n",
      "N00746\tIAV NS1 to nuclear export of mRNA\n",
      "N00748\tGPI-anchor biosynthesis\n",
      "N00759\tSteroid hormone biosynthesis, cholesterol to pregnenolone/progesterone\n",
      "N00765\tbeta-Oxidation, acyl-CoA synthesis\n",
      "N00776\tbeta-Oxidation, peroxisome, VLCFA\n",
      "N00779\tbeta-Oxidation, peroxisome, bile acid\n",
      "N00782\tTSH-TG signaling pathway\n",
      "N00786\tTransport of iodide\n",
      "N00789\tMutation-inactivated TPO to iodide organification/coupling reactions\n",
      "N00791\tDeiodination of MIT and DIT\n",
      "N00793\tTSH-DUOX2-TG signaling pathway\n",
      "N00795\tDUOX2-generated H2O2 production\n",
      "N00798\tThyroid hormone signaling pathway\n",
      "N00803\tIodide organification/coupling reactions\n",
      "N00804\tbeta-Oxidation\n",
      "N00805\tBile acid biosynthesis\n",
      "N00812\tTransport of carnitine\n",
      "N00814\tTransport of L-palmitoylcarnitine\n",
      "N00816\tTransport of glucose 6-phosphate\n",
      "N00818\tTransport of glucose\n",
      "N00820\tN-Glycan biosynthesis\n",
      "N00824\tTransport of GDP-fucose\n",
      "N00826\tTransport of UDP-galactose\n",
      "N00828\tTransport of CMP-N-acetylneuraminate\n",
      "N00830\tTransport of Man5GlcNAc2-PP-dolichol\n",
      "N00832\tBranched-chain amino acids degradation 1\n",
      "N00842\tPropanoyl-CoA metabolism\n",
      "N00847\tGalactose degradation\n",
      "N00851\tLeucine degradation\n",
      "N00852\tValine degradation\n",
      "N00856\tIsoleucine degradation\n",
      "N00859\tYersinia YopP/J to TLR2/4-NFKB signaling pathway\n",
      "N00862\tYersinia YopP/J to TLR2/4-MAPK signaling pathway\n",
      "N00863\tYersinia YopM to NLRP3 Inflammasome signaling pathway\n",
      "N00864\tYersinia YopK to NLRP3 Inflammasome signaling pathway\n",
      "N00865\tYersinia YopK to NLRC4 Inflammasome signaling pathway\n",
      "N00866\tYersinia YopM to Pyrin Inflammasome signaling pathway\n",
      "N00867\tNLRC4 inflammasome signaling pathway\n",
      "N00868\tPyrin inflammasome signaling pathway\n",
      "N00869\tKISS1-KISS1R-PLCB-PKC signaling pathway\n",
      "N00873\tGnRH-GnRHR-PLCB-PKC signaling pathway\n",
      "N00879\tPROK-PRKR-Gi-ERK signaling pathway\n",
      "N00882\tTAC3-TACR3-PLC-PKC signaling pathway\n",
      "N00885\tLHCGR-GNAS-PKA signaling pathway\n",
      "N00888\tHypoxanthine oxidation\n",
      "N00890\tMolybdenum cofactor biosynthesis\n",
      "N00899\t5-Oxoproline metabolism\n",
      "N00904\tGlutathione reduction\n",
      "N00905\tNADP+ reduction\n",
      "N00907\tGH-Jak-STAT signaling pathway\n",
      "N00910\tGHRHR-PKA-GH signaling pathway\n",
      "N00915\tAVP-V2R-PKA signaling pathway\n",
      "N00918\tTRH-TRHR-PLCB-PKC signaling pathway\n",
      "N00920\tPRL-JAK-STAT signaling pathway\n",
      "N00922\tFSHR-GNAS-PKA signaling pathway\n",
      "N00924\tGlucocorticoid receptor signaling pathway\n",
      "N00926\tEscherichia Tir to TLR2/4-MAPK signaling pathway\n",
      "N00927\tEscherichia/Shigella NleE/OspZ to TNF-NFKB signaling pathway\n",
      "N00928\tEscherichia NleB to TNF-NFKB signaling pathway\n",
      "N00929\tEscherichia NleC to TNF-NFKB signaling pathway\n",
      "N00930\tEscherichia NleD to TNF-JNK signaling pathway\n",
      "N00931\tEscherichia NleD to TNF-p38 signaling pathway\n",
      "N00932\tEscherichia NleH1 to TNF-NFKB signaling pathway\n",
      "N00933\tEscherichia NleA to NLRP3 inflammasome signaling pathway\n",
      "N00934\tNon-canonical inflammasome signaling pathway\n",
      "N00935\tEscherichia NleF to non-canonical inflammasome signaling pathway\n",
      "N00936\tEscherichia NleB1 to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00937\tEscherichia NleF to extrinsic apoptotic pathway\n",
      "N00938\tEscherichia NleH to intrinsic apoptotic pathway\n",
      "N00939\tEscherichia EspF to intrinsic apoptotic pathway\n",
      "N00940\tNOD-NFKB signaling pathway\n",
      "N00941\tShigella IpaH9.8 to NOD-NFKB signaling pathway\n",
      "N00942\tShigella OspG to TNF-NFKB signaling pathway\n",
      "N00943\tShigella IpaH4.5 to TNF-NFKB signaling pathway\n",
      "N00944\tShigella OspI to TNF-NFKB signaling pathway\n",
      "N00945\tShigella IpaH1.4/2.5 to TNF-NFKB signaling pathway\n",
      "N00946\tShigella IpaJ to cGAS-STING signaling pathway\n",
      "N00947\tShigella Ipa4.5 to cGAS-STING signaling pathway\n",
      "N00948\tShigella IpaH7.8 to NLRP3 Inflammasome signaling pathway\n",
      "N00949\tShigella IpaB to NLRC4 Inflammasome signaling pathway\n",
      "N00950\tShigella FimA to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N00951\tITGA/B-RHOG-RAC signaling pathway\n",
      "N00952\tShigella IpgB1 to ITGA/B-RHOG-RAC signaling pathway\n",
      "N00953\tmGluR1-TRPC3 signaling pathway\n",
      "N00954\tMutation-activated GRM1 to mGluR1-TRPC3 signaling pathway\n",
      "N00955\tMutation-inactivated PRKCG to mGluR1-TRPC3 signaling pathway\n",
      "N00956\tMutation-activated PRKCG to mGluR1-TRPC3 signaling pathway\n",
      "N00957\tMutation-caused abberant ATXN2/3 to mGluR5-Ca2+ -apoptotic pathway\n",
      "N00958\tMutation-activated ITPR1 to mGluR1-TRPC3 signaling pathway\n",
      "N00959\tITPR1-reduced expression to mGluR1-TRPC3 signaling pathway\n",
      "N00960\tMutation-caused aberrant SPTBN2 to mGluR1-TRPC3 signaling pathway\n",
      "N00961\tMutation-activated TRPC3 to mGluR1-TRPC3 signaling pathway\n",
      "N00962\tMutation-inactivated ATXN3 to autophagy-vesicle nucleation\n",
      "N00963\tRELN-VLDLR-PI3K signaling pathway\n",
      "N00964\tDAB1-overexpression to RELN-VLDLR-PI3K signaling pathway\n",
      "N00965\tRORA-mediated transcription\n",
      "N00966\tMutation-caused aberrant ATXN1 to RORA-mediated transcription\n",
      "N00967\tVGCC-Ca2+ -apoptotic pathway\n",
      "N00968\tMutation-activated CACNA1A to VGCC-Ca2+ -apoptotic pathway\n",
      "N00969\tMutation-inactivated CACNA1A to VGCC-Ca2- -apoptotic pathway\n",
      "N00970\tTransport of calcium\n",
      "N00971\tMutation-caused aberrant PDYN to transport of calcium\n",
      "N00972\tTransport of potassium\n",
      "N00973\tMutation-inactivated KCNC3 to transport of potassium\n",
      "N00974\tTransport of potassium\n",
      "N00975\tMutation-inactivated KCND3 to transport of potassium\n",
      "N00976\tRetrograde axonal transport\n",
      "N00977\tMutation-caused aberrant Htt to retrograde axonal transport\n",
      "N00978\tAnterograde axonal transport\n",
      "N00979\tMutation-caused aberrant Htt to anterograde axonal transport\n",
      "N00980\tMutation-caused aberrant Htt to REST-mediated transcriptional repression\n",
      "N00981\tMutation-caused aberrant Htt to CREB-mediated transcription\n",
      "N00982\tMutation-caused aberrant Htt to p53-mediated transcription\n",
      "N00983\tMutation-caused aberrant Htt to extrinsic apoptotic pathway\n",
      "N00984\tmGluR5-Ca2+ -apoptotic pathway\n",
      "N00985\tMutation-caused aberrant Htt to mGluR5-Ca2+ -apoptotic pathway\n",
      "N00986\tMutation-caused aberrant Htt to VGCC-Ca2+ -apoptotic pathway\n",
      "N00987\tMutation-caused aberrant Htt to transport of calcium\n",
      "N00988\tElectron transfer in Complex II\n",
      "N00989\tMutation-caused aberrant Htt to electron transfer in Complex II\n",
      "N00990\tElectron transfer in Complex III\n",
      "N00991\tMutation-caused aberrant Htt to electron transfer in Complex III\n",
      "N00992\tMutation-caused aberrant Htt to TNF-JNK signaling pathway\n",
      "N00993\tMutation-caused aberrant Htt to autophagy-vesicle nucleation\n",
      "N00994\tAGE-RAGE signaling pathway\n",
      "N00995\tElectron transfer in Complex I\n",
      "N00996\tMutation-caused aberrant Abeta to AGE-RAGE signaling pathway\n",
      "N00997\tMutation-caused aberrant Abeta to electron transfer in Complex I\n",
      "N00998\tElectron transfer in Complex IV\n",
      "N00999\tMutation-caused aberrant Abeta to electron transfer in Complex IV\n",
      "N01000\tmAChR-Ca2+ -apoptotic pathway\n",
      "N01001\tMutation-caused aberrant Abeta to mAchR-Ca2+ -apoptotic pathway\n",
      "N01002\tMutation-caused aberrant Abeta to mGluR5-Ca2+ -apoptotic pathway\n",
      "N01003\tMutation-caused aberrant Abeta to transport of calcium\n",
      "N01004\tMutation-caused aberrant Abeta to VGCC-Ca2+ -apoptotic pathway\n",
      "N01005\tMutation-caused aberrant Abeta to crosstalk between extrinsic and intrinsic apoptotic pathways\n",
      "N01006\tMutation-caused aberrant Abeta to VGCC-Ca2+ -apoptotic pathway\n",
      "N01007\tMutation-caused aberrant PSEN to mGluR5-Ca2+ -apoptotic pathway\n",
      "N01008\tMutation-caused aberrant PSEN1 to mGluR5-Ca2+ -apoptotic pathway\n",
      "N01009\tPERK-ATF4 signaling pathway\n",
      "N01010\tMutation-caused aberrant PSEN1 to PERK-ATF4 signaling pathway\n",
      "N01011\tIRE1a-XBP1 signaling pathway\n",
      "N01012\tMutation-caused aberrant PSEN1 to IRE1a-XBP1 signaling pathway\n",
      "N01013\tIRE1a-JNK signaling pathway\n",
      "N01014\tMutation-caused aberrant Abeta to IRE1a-JNK signaling pathway\n",
      "N01015\tATF6-mediated transcription\n",
      "N01016\tMutation-caused aberrant PSEN1 to ATF6-mediated transcription\n",
      "N01017\tMutation-caused aberrant PSEN1 to anterograde axonal transport\n",
      "N01018\tMutation-caused aberrant Abeta to anterograde axonal transport\n",
      "N01019\tParkin-mediated ubiquitination\n",
      "N01020\tMutation-inactivated PRKN to Parkin-mediated ubiquitination\n",
      "N01021\tParkin-mediated ubiquitination\n",
      "N01022\tMutation-inactivated PRKN to Parkin-mediated ubiquitination\n",
      "N01023\tParkin-mediated ubiquitination\n",
      "N01024\tMutation-inactivated PRKN to Parkin-mediated ubiquitination\n",
      "N01025\tParkin-mediated ubiquitination\n",
      "N01026\tMutation-inactivated PRKN to Parkin-mediated ubiquitination\n",
      "N01027\tUCHL1-mediated hydrolysis\n",
      "N01028\tMutation-inactivated UCHL1 to UCHL1-mediated hydrolysis\n",
      "N01029\t26S proteasome-mediated protein degradation\n",
      "N01030\tMutation-caused aberrant SNCA to 26S proteasome-mediated protein degradation\n",
      "N01031\tMutation-caused aberrant SNCA to VGCC-Ca2+ -apoptotic pathway\n",
      "N01032\tMutation-inactivated PRKN to mGluR1 signaling pathway\n",
      "N01033\tMutation-caused aberrant SNCA to ATF6-mediated transcription\n",
      "N01034\tMutation-caused aberrant SNCA to IRE1a-XBP1 signaling pathway\n",
      "N01035\tMutation-caused aberrant SNCA to PERK-ATF4 signaling pathway\n",
      "N01037\tMutation-caused aberrant SNCA to L-DOPA generation\n",
      "N01039\tMutation-inactivated PRKN to DOPAL generation\n",
      "N01040\tTransport of dopamine to synaptic vesicle\n",
      "N01041\tMutation-caused aberrant SNCA to transport of dopamine\n",
      "N01042\tMutation-caused aberrant SNCA to electron transfer in Complex I\n",
      "N01043\tMutation-inactivated PINK1 to electron transfer in Complex I\n",
      "N01044\tMPP+ to electron transfer in Complex I\n",
      "N01045\tRotenone to electron transfer in Complex I\n",
      "N01046\tManeb to electron transfer in Complex III\n",
      "N01047\tMutation-activated LRRK2 to intrinsic apoptotic pathway\n",
      "N01048\tMutation-inactivated PINK1 to intrinsic apoptotic pathway\n",
      "N01049\tMutation-inactivated PRKN to intrinsic apoptotic pathway\n",
      "N01050\tMutation-inactivated PINK1 to intrinsic apoptotic pathway\n",
      "N01051\tMutation-inactivated DJ1 to intrinsic apoptotic pathway\n",
      "N01052\tPINK1-Parkin-mediated MFN2 degradation\n",
      "N01053\tMutation-inactivated PINK1 to PINK1-Parkin-mediated MFN2 degradation\n",
      "N01054\tMutation-inactivated PRKN to PINK1-Parkin-mediated MFN2 degradation\n",
      "N01055\tMutation-caused aberrant SNCA to anterograde axonal transport\n",
      "N01056\tFAS-JNK signaling pathway\n",
      "N01057\tMutation-inactivated DJ1 to FAS-JNK signaling patwhay\n",
      "N01058\tMutation-inactivated DJ1 to to p53-mediated transcription\n",
      "N01059\tMutation-inactivated DJ1 to KEAP1-NRF2 signaling pathway\n",
      "N01060\tMutation-caused aberrant Abeta to 26S proteasome-mediated protein degradation\n",
      "N01061\tMutation-caused aberrant Htt to 26S proteasome-mediated protein degradation\n",
      "N01062\tMutation-activated MET to RAS-ERK signaling pathway\n",
      "N01063\tMutation-activated MET to PI3K signaling pathway\n",
      "N01064\tMutation-activated RET to RAS-ERK signaling pathway\n",
      "N01065\tMutation-activated RET to PI3K signaling pathway\n",
      "N01066\tARNO-ARF-ACTB_G signaling pathway\n",
      "N01067\tShigella IpgD to ARNO-ARF-ACTB_G signaling pathway\n",
      "N01068\tITGA/B-FAK-RAC signaling pathway\n",
      "N01069\tShigella IpgB1 to ITGA/B-FAK-RAC signaling pathway\n",
      "N01070\tITGA/B-FAK-CDC42 signaling pathway\n",
      "N01071\tShigella IpgB1 to ITGA/B-FAK-CDC42 signaling pathway\n",
      "N01072\tITGA/B-RhoGEF-RhoA signaling pathway\n",
      "N01073\tShigella IpgB2 to ITGA/B-RhoGEF-RhoA signaling pathway\n",
      "N01074\tShigella IpaA to ITGA/B-RhoGEF-RhoA signaling pathway\n",
      "N01075\tShigella IcsB to ITGA/B-RhoGEF-RhoA signaling pathway\n",
      "N01076\tShigella IcsB to ITGA/B-FAK-CDC42 signaling pathway\n",
      "N01077\tShigella IcsB to ITGA/B-FAK-RAC signaling pathway\n",
      "N01078\tEGF-EGFR-Actin signaling pathway\n",
      "N01079\tShigella IpaC to Actin signaling pathway\n",
      "N01080\tITGA/B-TALIN/VINCULIN signaling pathway\n",
      "N01081\tShigella IpaB/C/D to ITGA/B-TALIN/VINCULIN signaling pathway\n",
      "N01082\tShigella IpaA to ITGA/B-TALIN/VINCULIN signaling pathway\n",
      "N01083\tShigella OspE to ITGA/B-TALIN/VINCULIN signaling pathway\n",
      "N01084\tEscherichia EspG to ARNO-ARF-ACTB/G signaling pathway\n",
      "N01085\tEscherichia EspG to ARNO-ARF-ACTB/G signaling pathway\n",
      "N01086\tEscherichia EspT to RAC signaling pathway\n",
      "N01087\tEscherichia EspW to RAC signaling pathway\n",
      "N01088\tEscherichia EspH to LPA-GNA12/13-RhoA signaling pathway\n",
      "N01089\tEscherichia EspM to LPA-GNA12/13-Rho signaling pathway\n",
      "N01090\tIGG-FCGR-RAC signaling pathway\n",
      "N01091\tEscherichia EspJ to IGG-FCGR-RAC signaling pathway\n",
      "N01092\tEscherichia Eae/Tir to Actin signaling pathway\n",
      "N01093\tEscherichia EspJ/Tir to Actin signaling pathway\n",
      "N01094\tEscherichia Eae/Tir/TccP to Actin signaling pathway\n",
      "N01095\tEscherichia Map to LPA-GNA12/13-RhoA signaling pathway\n",
      "N01096\tEscherichia Map to CDC42 signaling pathway\n",
      "N01097\tLPA-GNA12/13-RhoA signaling pathway\n",
      "N01098\tYersinia YopT to ITGA/B-RhoGEF-RhoA signaling pathway\n",
      "N01099\tYersinia YopE to RhoA signaling pathway\n",
      "N01100\tYersinia YopE to ITGA/B-RHOG-RAC signaling pathway\n",
      "N01101\tYersinia YopT to ITGA/B-RHOG-RAC signaling pathway\n",
      "N01102\tYersinia YopE to ITGA/B-RHOG-RAC signaling pathway\n",
      "N01103\tYersinia YpkA to IGG-FCGR-RAC signaling pathway\n",
      "N01104\tLPA-GNAQ/11-RhoA signaling pathway\n",
      "N01105\tYersinia YpkA to LPA-GNAQ-RhoA signaling pathway\n",
      "N01106\tTCR-PLCG-ITPR signaling pathway\n",
      "N01107\tYersinia YopH to TCR-NFAT signaling pathway\n",
      "N01108\tYersinia YopH to TCR-NFAT signaling pathway\n",
      "N01109\tYersinia YopH to ITGA/B-FAK-RAC signaling pathway\n",
      "N01110\tYersinia YopH to ITGA/B-FAK-RAC signaling pathway\n",
      "N01111\tYersinia YopH to ITGA/B-FAK-RAC signaling pathway\n",
      "N01112\tSalmonella SopE/E2 to NOD-NFKB signaling pathway\n",
      "N01113\tSalmonella SseK1 to TNF-NFKB signaling pathway\n",
      "N01114\tSalmonella SseK3 to TNF-NFKB signaling pathway\n",
      "N01116\tSalmonella SseL to TNF-NFKB signaling pathway\n",
      "N01117\tSalmonella GogB to TNF-NFKB signaling pathway\n",
      "N01118\tSalmonella SpvD to TNF-NFKB signaling pathway\n",
      "N01119\tRAC/CDC42-PAK-ERK signaling pathway\n",
      "N01120\tSalmonella SptP to RAC/CDC42-PAK-ERK signaling pathway\n",
      "N01121\tSalmonella SpvC to ERK signaling pathway\n",
      "N01122\tSalmonella PipA/GogA/GtgA to TNF-NFKB signaling pathway\n",
      "N01123\tSalmonella AvrA to TNF-NFKB signaling pathway\n",
      "N01124\tSalmonella AvrA to beta-catenin signaling pathway\n",
      "N01125\tSalmonella AvrA to TNF-JNK signaling pathway\n",
      "N01126\tSalmonella SipB to Inflammasome signaling pathway\n",
      "N01127\tSalmonella SopE to Inflammasome signaling pathway\n",
      "N01128\tSalmonella SopE to RAC signaling pathway\n",
      "N01129\tSalmonella SopB to ARNO-ARF-ACTB/G signaling pathway\n",
      "N01130\tSalmonella SopB to RhoA signaling pathway\n",
      "N01131\tSalmonella SopE/E2 to RhoA signaling pathway\n",
      "N01132\tSalmonella SopE/E2 to RhoG signaling pathway\n",
      "N01133\tSalmonella SopB to RhoG signaling pathway\n",
      "N01134\tSalmonella SopB to CDC42 signaling pathway\n",
      "N01135\tMutation-caused aberrant SOD1 to intrinsic apoptotic pathway\n",
      "N01136\tMutation-caused aberrant TDP43 to electron transfer in Complex I\n",
      "N01137\tPINK-Parkin-mediated autophagosome formation\n",
      "N01138\tMutation-inactivated OPTN to PINK-Parkin-mediated autophagosome formation\n",
      "N01139\tMutation-inactivated p62 to PINK-Parkin-mediated autophagosome formation\n",
      "N01140\tTBK1-mediated autophagosome formation\n",
      "N01141\tMutation-inactivated TBK1 to TBK1-mediated autophagosome formation\n",
      "N01142\tC9orf72-mediated autophagy initiation\n",
      "N01143\tMutation-inactivated C9orf72 to C9orf72-mediated autophagy initiation\n",
      "N01144\tMutation-caused aberrant SOD1 to 26S proteasome-mediated protein degradation\n",
      "N01145\tMutation-inactivated VCP to 26S proteasome-mediated protein degradation\n",
      "N01146\tMutation-inactivated UBQLN2 to 26S proteasome-mediated protein degradation\n",
      "N01147\tMutation-caused aberrant SOD1 to ATF6-mediated transcription\n",
      "N01148\tMutation-caused aberrant SOD1 to IRE1a-XBP1 signaling pathway\n",
      "N01149\tMutation-caused aberrant SOD1 to PERK-ATF4 signaling pathway\n",
      "N01150\tMutation-inactivated VAPB to ATF6-mediated transcription\n",
      "N01151\tMutation-inactivated SIGMAR1 to Ca2+ -apoptotic pathway\n",
      "N01152\tNuclear export of mRNA\n",
      "N01153\tMutation-caused aberrant GLE1 to nuclear export of mRNA\n",
      "N01154\tTDP-43-regulated splicing\n",
      "N01155\tMutation-caused aberrant TDP43 to TDP-43-regulated splicing\n",
      "N01156\tFUS-regulated splicing\n",
      "N01157\tMutation-caused aberrant FUS to FUS-regulated splicing\n",
      "N01158\tMutation-caused aberrant DCTN1 to retrograde axonal transport\n",
      "N01159\tMutation-caused aberrant TUBA4A to retrograde axonal transport\n",
      "N01160\tMutation-caused aberrant SOD1 to retrograde axonal transport\n",
      "N01161\tActin polymerization\n",
      "N01162\tMutation-caused aberrant PFN1 to actin polymerization\n",
      "N01163\tNRG-ERBB4-PI3K signaling pathway\n",
      "N01164\tMutation-inactivated ERBB4 to NRG-ERBB4-PI3K signaling pathway\n",
      "N01165\tPDL/PD1-SHP-PI3K signaling pathway\n",
      "N01197\tScrapie conformation PrPSc to 26S proteasome-mediated protein degradation\n",
      "N01198\tScrapie conformation PrPSc to PERK-ATF4 signaling pathway\n",
      "N01199\tScrapie conformation PrPSc to mGluR5-Ca2+ -apoptotic pathway\n",
      "N01200\tScrapie conformation PrPSc to transport of calcium\n",
      "N01201\tScrapie conformation PrPSc to VGCC-Ca2+ -apoptotic pathway\n",
      "N01202\tOligomeric conformation PrPc to anterograde axonal transport\n",
      "N01203\tScrapie conformation PrPSc to Notch singling pathway\n",
      "N01204\tPRNP-PI3K-NOX2 signaling pathway\n",
      "N01205\tScrapie conformation PrPSc to PRNP-PI3K-NOX2 signaling pathway\n",
      "N01282\tRegulation of CAV1.1\n",
      "N01283\tShigella OspF to TLR2/4-MAPK signaling pathway\n",
      "N01284\tShigella IcsP to Autophagy-vesicle elongation\n",
      "N01285\tMicrotubule-RHOA signaling pathway\n",
      "N01286\tEscherichia EspG to Microtubule-RHOA signaling pathway\n",
      "N01287\tTight junction-Actin signaling pathway\n",
      "N01288\tEscherichia EspF to Tight junction-Actin signaling pathway\n",
      "N01289\tCOPII vesicle formation\n",
      "N01290\tEscherichia NleA to COPII vesicle formation\n",
      "N01291\tTRAPPI-RAB1 signaling pathway\n",
      "N01292\tEscherichia EspG to RAB1 signaling pathway\n",
      "N01293\tCOPI vesicle formation\n",
      "N01294\tEscherichia NleF to COPI vesicle formation\n",
      "N01295\tRab7-regulated microtubule minus-end directed transport\n",
      "N01296\tSalmonella SopD2 to Rab7-regulated microtubule minus-end directed transport\n",
      "N01297\tArl8-regulated microtubule plus-end directed transport\n",
      "N01298\tSalmonella SifA to microtubule plus-end directed transport\n",
      "N01299\tSalmonella PipB2 to microtubule plus-end directed transport\n",
      "N01300\tTethering of late endosomes and lysosomes\n",
      "N01301\tSalmonella SifA to Tethering of late endosomes and lysosomes\n",
      "N01302\tEarly endosomal fusion\n",
      "N01303\tSalmonella SopB to Early endosomal fusion\n",
      "N01304\tANXA2-S100A10-regulated actin cytoskeleton\n",
      "N01305\tSalmonella SopB to ANXA2-S100A10-regulated actin cytoskeleton\n",
      "N01306\tAngII-AT1R-NOX2 signaling pathway\n",
      "N01307\tSARS-CoV-2 S to AngII-AT1R-NOX2 signaling pathway\n",
      "N01308\tMDA5-IRF7/3 signaling pathway\n",
      "N01309\tSARS-CoV-2 nsp3 to MDA5-IRF7/3 signaling pathway\n",
      "N01310\tSARS-CoV-2 nsp13 to RIG-I-IRF7/3 signaling pathway\n",
      "N01312\tSARS-CoV-2 S to lectin pathway of complement cascade\n",
      "N01314\tSARS-CoV-2 S to classical pathway of complement cascade\n",
      "N01315\tLectin pathway of coagulation cascade, prothrombin to thrombin\n",
      "N01316\tSARS-CoV-2 S/N to lectin pathway of coagulation cascade\n",
      "N01317\tTranslation initiation\n",
      "N01318\tSARS-CoV-2 nsp1 to translation initiation\n",
      "N01319\tSARS-CoV-2 nsp6 and ORF6 to RIG-I-IRF7/3 signaling pathway\n",
      "N01320\tSARS-CoV-2 nsp3 to RIG-I-IRF7/3 signaling pathway\n",
      "N01321\tSARS-CoV-2 nsp1/6/13, ORF3a/6/7b and M to IFN signaling pathway\n",
      "N01322\tSARS-CoV-2 nsp6/13 and ORF7a/7b to IFN signaling pathway\n",
      "N01336\tCHRNA7-E2F signaling pathway\n",
      "N01337\tNNK/NNN to CHRNA7-E2F signaling pathway\n",
      "N01338\tACH-CHRN-PI3K signaling pathway\n",
      "N01339\tNNK/NNN to PI3K signaling pathway\n",
      "N01340\tACH-CHRN-JAK-STAT signaling pathway\n",
      "N01341\tNNK/NNN to Jak-STAT signaling pathway\n",
      "N01342\tNicotine to Jak-STAT signaling pathway\n",
      "N01343\tACH-CHRN-RAS-ERK signaling pathway\n",
      "N01344\tNNK/NNN to RAS-ERK signaling pathway\n",
      "N01345\tEP/NE-ADRB-cAMP signaling pathway\n",
      "N01346\tNicotine/NNK to cAMP signaling pathway\n",
      "N01347\tEP/NE-ADRB-PI3K signaling pathway\n",
      "N01348\tNicotine/NNK to PI3K signaling pathway\n",
      "N01349\tACH-CHRN-PI3K signaling pathway\n",
      "N01350\tNNK/NNN to PI3K signaling pathway\n",
      "N01351\tE2-ER-RAS-ERK signaling pathway\n",
      "N01352\tBPA to RAS-ERK signaling pathway\n",
      "N01353\tE2 to RAS-ERK signaling pathway\n",
      "N01354\tBPA to RAS-ERK signaling pathway\n",
      "N01355\tArsenic to PI3K signaling pathway\n",
      "N01356\tMembrane-initiated progesterone signaling pathway\n",
      "N01357\tP4/MPA to membrane-initiated progesterone signaling pathway\n",
      "N01358\tP4-PR-PI3K signaling pathway\n",
      "N01359\tP4/MPA to PR-PI3K signaling pathway\n",
      "N01360\tP4-PR-RAS-ERK signaling pathway\n",
      "N01361\tP4/MPA to PR-RAS-ERK signaling pathway\n",
      "N01362\tNuclear-initiated progesterone signaling pathway\n",
      "N01363\tP4/MPA to nuclear-initiated progesterone signaling pathway\n",
      "N01364\tE2 to nuclear-initiated estrogen signaling pathway\n",
      "N01365\tTCDD to Ahr signaling pathway\n",
      "N01366\tBaP to Ahr signaling pathway\n",
      "N01367\tPCB to Ahr signaling pathway\n",
      "N01368\tHCB to Ahr signaling pathway\n",
      "N01369\t4-ABP to DNA adducts\n",
      "N01370\tPhIP to DNA adducts\n",
      "N01371\tPhIP to DNA adducts\n",
      "N01372\tIQ to DNA adducts\n",
      "N01373\tMeIQx to DNA adducts\n",
      "N01374\tBaP to DNA adducts\n",
      "N01375\tDMBA to DNA adducts\n",
      "N01376\tMelphalan to DNA adducts/cross-links\n",
      "N01377\tThiotepa to DNA adducts/cross-links\n",
      "N01378\tAFB1 to DNA adducts\n",
      "N01379\tNNK to DNA adducts\n",
      "N01380\tNNK to DNA adducts\n",
      "N01381\tNNK to DNA adducts\n",
      "N01382\tNNK to DNA adducts\n",
      "N01383\tNDMA to DNA adducts\n",
      "N01384\tEO to DNA adducts\n",
      "N01385\tVC to DNA adducts\n",
      "N01386\tDCE to DNA adducts\n",
      "N01387\tSM to DNA adducts/cross-links\n",
      "N01388\tSOD/Cat-mediated ROS neutralization\n",
      "N01389\tLead to SOD/Cat-mediated ROS neutralization\n",
      "N01390\tp,p'-DDT to SOD/Cat-mediated ROS neutralization\n",
      "N01391\tLead to SOD/Cat-mediated ROS neutralization\n",
      "N01392\tArsenic to electron transfer in complex II\n",
      "N01393\tArsenic to electron transfer in complex II\n",
      "N01394\tArsenic to electron transfer in complex IV\n",
      "N01395\tCadmium to electron transfer in complex III\n",
      "N01396\t4-Aminobiphenyl to CYP-mediated metabolism\n",
      "N01397\t4-Aminobiphenyl to CYP-mediated metabolism\n",
      "N01398\tPentachlorophenol to CYP-mediated metabolism\n",
      "N01399\tBenzene to CYP-mediated metabolism\n",
      "N01400\tBenzene to CYP-mediated metabolism\n",
      "N01401\tBenzo[a]pyrenre to CYP-mediated metabolism\n",
      "N01402\tManganese to electron transfer in Complex II\n",
      "N01403\tZn to anterograde axonal transport\n",
      "N01404\t17beta-estradiol to CYP-mediated metabolism\n",
      "N01405\t17beta-estradiol to CYP-mediated metabolism\n",
      "N01406\tEthanol to CYP-mediated metabolism\n",
      "N01407\tMetals to JNK signaling pathway\n",
      "N01408\tMetals to RAS-ERK signaling pathway\n",
      "N01409\tMetals to PI3K signaling pathway\n",
      "N01410\tMetals to NFKB signaling pathway\n",
      "N01411\tMetals to NFKB signaling pathway\n",
      "N01412\tMetals to HTF-1 signaling pathway\n",
      "N01413\tMetals to KEAP1-NRF2 signalig pathway\n",
      "N01414\tIron to anterograde axonal transport\n",
      "N01415\tNEP-mediated Abeta degradation\n",
      "N01416\tMercury to NEP-mediated Abeta degradation\n",
      "N01417\tParaquat to FAS-JNK signaling pathway\n",
      "N01418\tPurine salvage pathway, adenine to AMP\n",
      "N01419\tAPRT deficiency in purine salvage pathway\n",
      "N01420\tAPRT deficiency in adenine metabolism\n",
      "N01421\tPurine salvage pathway, hypoxanthine/guanine to IMP/GMP\n",
      "N01422\tHPRT1 deficiency in purine salvage pathway\n",
      "N01423\tHPRT1 deficiency in hypoxanthine metabolism\n",
      "N01424\tHPRT1 deficiency in guanine metabolism\n",
      "N01425\tGlobal genome NER\n",
      "N01426\tBMP9/10 signaling pathway\n",
      "N01427\tWNT5A-ROR signaling pathway\n",
      "N01428\tBMP signaling pathway, BMP antagonist\n",
      "N01429\tCytosolic Ca2+ removal, PMCA\n",
      "N01430\tTranscription-coupled NER\n",
      "N01431\tCore NER reaction\n",
      "N01432\tMismatch repair\n",
      "N01433\tBase excision and strand cleavage by monofunctional glycosylase\n",
      "N01434\tBase excision and strand cleavage by bifunctional glycosylase\n",
      "N01435\tBase excision and strand cleavage by NEIL glycosylase\n",
      "N01436\tLong patch BER\n",
      "N01437\tShort patch BER\n",
      "N01438\tMitochondrial BER\n",
      "N01439\tDouble-strand break signaling\n",
      "N01440\tWnt signaling modulation, LGR/RSPO\n",
      "N01441\tWnt signaling modulation, SOST/LRP4\n",
      "N01442\tWnt signaling modulation, Wnt inhibitor\n",
      "N01443\tWnt signaling modulation, Wnt acylation\n",
      "N01444\tNXN mutation to WNT5A-ROR signaling pathway\n",
      "N01445\tNon-homologous end-joining\n",
      "N01446\tDNA end resection and RPA loading\n",
      "N01447\tDouble Holliday junction dissolution\n",
      "N01448\tDouble Holliday junction resolution\n",
      "N01449\tSynthesis-dependent strand annealing\n",
      "N01450\tBreak induced replication\n",
      "N01451\tATR signaling\n",
      "N01452\tHomologous recombination\n",
      "N01453\tBMP signaling pathway\n",
      "N01454\tAMH signaling pathway\n",
      "N01455\tBMP15 signaling pathway\n",
      "N01456\tActivin signaling pathway\n",
      "N01457\tMyostatin signaling pathway\n",
      "N01458\tBMP-HAMP signaling pathway\n",
      "N01459\tNodal signaling pathway\n",
      "N01460\tPlasmin mediated activation of latent TGF-beta\n",
      "N01461\tBMP-HAMP signaling pathway, auxiliary factor\n",
      "N01462\tBMP9/10 signaling pathway, BMP9/10 coreceptor\n",
      "N01464\tFanconi anemia pathway\n",
      "N01465\tLesion bypass by TLS and DSB formation\n",
      "N01466\tHomologous recombination in ICLR\n",
      "N01467\tV(D)J recombination\n",
      "N01468\tDNA replication licensing\n",
      "N01469\tCdt1 downregulation\n",
      "N01470\tPre-IC formation\n",
      "N01471\tOrigin unwinding and elongation\n",
      "N01472\tOkazaki fragment maturation\n",
      "N01473\tDNA replication termination\n",
      "N01474\tTRAIP-dependent replisome disassembly\n",
      "N01475\tTelomerase RNA maturation\n",
      "N01476\tAssembly and trafficking of telomerase\n",
      "N01477\tTelomere elongation\n",
      "N01478\tNotch proteolytic activation\n",
      "N01479\tNotch ligand ubiquitylation\n",
      "N01480\tNotch-HES7 signaling\n",
      "N01481\tNotch-MESP2 signaling\n",
      "N01482\tCohesin loading\n",
      "N01483\tCohesin acetylation\n",
      "N01484\tEstablishment of cohesion\n",
      "N01485\tCohesin dissociation in prophase\n",
      "N01486\tCohesin dissociation in anaphase\n",
      "N01487\tClassical pathway of complement cascade, C4/C2 to C3 convertase formation\n",
      "N01489\tClassical/Lectin pathway of complement cascade, C5 convertase formation\n",
      "N01490\tCommon pathway of complement cascade, MAC formation\n",
      "N01491\tLectin pathway of complement cascade, C4/C2 to C3 convertase formation\n",
      "N01493\tAlternative pathway of complement cascade, C3 convertase formation\n",
      "N01494\tAlternative pathway of complement cascade, C3/5 convertase formation\n",
      "N01495\tClassical/Lectin pathway of complement cascade, C4b breakdown\n",
      "N01496\tAlternative pathway of complement cascade, C3b breakdown\n",
      "N01497\tCondensin loading\n",
      "N01498\tInhibition of condensin II association\n",
      "N01499\tModifying of condensin II subunits\n",
      "N01500\tModifying of condensin I subunits\n",
      "N01501\tInactivation of condensin I\n",
      "N01502\tLectin pathway of coagulation cascade, fibrinogen to fibrin\n",
      "N01503\tExtrinsic pathway of coagulation cascade, F7 activation\n",
      "N01504\tRegulation of complement cascade, CFHR\n",
      "N01505\tRegulation of complement cascade, MAC inhibition\n",
      "N01506\tIntrinsic pathway of coagulation cascade, F12 activation\n",
      "N01507\tIntrinsic pathway of coagulation cascade, F11 activation\n",
      "N01508\tIntrinsic pathway of coagulation cascade, F9 activation\n",
      "N01509\tIntrinsic pathway of coagulation cascade, F8 activation\n",
      "N01510\tCommon pathway of coagulation cascade, F10 activation\n",
      "N01511\tCommon pathway of coagulation cascade, F5 activation\n",
      "N01512\tCommon pathway of coagulation cascade, prothrombin activation\n",
      "N01513\tCommon pathway of coagulation cascade, fibrinogen to fibrin\n",
      "N01514\tCommon pathway of coagulation cascade, F13 activation\n",
      "N01515\tRegulation of coagulation cascade, protein C system\n",
      "N01516\tKallikrein-kinin system, prekallikrein activation\n",
      "N01517\tKallikrein-kinin system, HMWK to bradykinin\n",
      "N01518\tFibrinolytic system\n",
      "N01519\tRegulation of coagulation cascade, AT3\n",
      "N01520\tRegulation of fibrinolytic system, C1INH\n",
      "N01521\tRegulation of coagulation cascade, HCF2\n",
      "N01522\tRegulation of fibrinolytic system, AAP\n",
      "N01523\tRegulation of fibrinolytic system, AAT\n",
      "N01524\tRegulation of fibrinolytic system, PAI\n",
      "N01525\tOrganization of the inner kinetochore\n",
      "N01526\tOrganization of the outer kinetochore\n",
      "N01527\tKSHV Kaposin to classical/Lectin pathway of complement cascade, C4b breakdown\n",
      "N01528\tKSHV Kaposin to alternative pathway of complement cascade, C3b breakdown\n",
      "N01529\tRecruitment and formation of the MCC\n",
      "N01530\tDopamine metabolism\n",
      "N01531\tCENPE interaction with NDC80 complex\n",
      "N01532\tKinetochore targeting of MAD1-MAD2\n",
      "N01533\tDisassembly of MCC\n",
      "N01534\tDynein recruitment to the kinetochore\n",
      "N01535\tKinetochore microtubule attachment\n",
      "N01536\tDephosphorylation of kinetochore\n",
      "N01537\tHedgehog signaling pathway, HH ligand secretion\n",
      "N01538\tHedgehog signaling pathway, PTCH coreceptor\n",
      "N01539\tRAD51 -dsDNA destabilization\n",
      "N01540\tEstrogen biosynthesis\n",
      "N01541\tTestosterone biosynthesis\n",
      "N01542\tPKA holoenzyme\n",
      "N01543\tTLR7/8/9-IRF5 signaling pathway\n",
      "N01544\tMicrotubule nucleation\n",
      "N01545\tRegulation of TNF-NFKB signaling pathway, LUBAC-mediated linear ubiquitination\n",
      "N01546\tRegulation of TNF-NFKB signaling pathway, OTULIN/TNFAIP3-mediated deubiquitination\n",
      "N01547\tKinetochore fiber organization\n",
      "N01548\tKinetochore-fiber stabilization\n",
      "N01549\tBranching microtubule nucleation\n",
      "N01550\tAdrenaline metabolism\n",
      "N01551\tSerotonin metabolism\n",
      "N01552\tEumelanin biosynthesis\n",
      "N01553\tPromotion of microtubule growth\n",
      "N01554\tIL2 family to Jak-STAT signaling pathway\n",
      "N01555\tHormone-like-cytokine to Jak-STAT signaling pathway\n",
      "N01556\tIL6 family to Jak-STAT signaling pathway\n",
      "N01557\tIL12/23 to Jak-STAT signaling pathway\n",
      "N01558\tType I interferon to Jak-STAT signaling pathway\n",
      "N01559\tType II interferon to Jak-STAT signaling pathway\n",
      "N01560\tRegulation of type I interferon to Jak-STAT signaling pathway, USP18\n",
      "N01561\tMicrotubule depolymerization\n",
      "N01562\tMicrotubule depolymerization at the minus ends\n",
      "N01563\tInhibition of Kif2A\n",
      "N01564\tPost-translational modifications of RIG-I and MDA5\n",
      "N01565\tAdenosine-to-inosine RNA editing by ADAR\n",
      "N01566\tTLR5-NFKB signaling pathway\n",
      "N01567\tNLRP1 inflammasome signaling pathway\n",
      "N01568\tRegulation of NLRP3 inflammasome signaling pathway, NLRP3 inhibition\n",
      "N01569\tNALP12 inflammasome signaling pathway\n",
      "N01570\tRegulation of Pyrin inflammasome signaling pathway, PSTPIP1\n",
      "N01571\tDNA degradation by extracellular/endolysosomal DNAse\n",
      "N01572\tRNASEH2-mediated RNA degradation in RNA-DNA hybrids\n",
      "N01573\tSAMHD1-mediated dNTP degradation\n",
      "N01574\tGlycosaminoglycan biosynthesis, linkage tetrasaccharide\n",
      "N01575\tTSC1/2-mTORC1 signaling pathway\n",
      "N01576\tSTRAD/STK11- TSC signaling pathway\n",
      "N01577\tGene silencing by methylation of H3K27 and ubiquitination of H2AK119\n",
      "N01578\tGATOR1-mTORC1 signaling pathway\n",
      "N01579\tCD80/CD86-CTLA4-PP2A signaling pathway\n",
      "N01580\tChondroitin sulfate biosynthesis\n",
      "N01581\tDermatan sulfate biosynthesis\n",
      "N01582\tHeparan sulfate biosynthesis\n",
      "N01583\tRegulation of extrinsic apoptotic pathway, XIAP\n",
      "N01584\tFLCN-mTORC1 signaling pathway\n",
      "N01585\tDeubiquitination of H2AK119\n",
      "N01586\tActivation of PRC2.2 by ubiquitination of H2AK119\n",
      "N01587\tFe-TF transport\n",
      "N01588\tFe3+ Ferritin transport\n",
      "N01589\tGlutathione biosynthesis\n",
      "N01590\tArachidonate/Adrenic acid metabolism\n",
      "N01591\tFe2+ Ferroportin transport\n",
      "N01592\tGF-RTK-RAS-ERK signaling pathway\n",
      "N01593\tRegulation of GF-RTK-RAS-ERK signaling, PTP\n",
      "N01594\tMLK-JNK signaling pathway\n",
      "N01595\tRegulation of GF-RTK-RAS-ERK signaling pathway, adaptor proteins\n",
      "N01596\tRegulation of GF-RTK-RAS-ERK signaling, RAS ubiquitination by CUL3 complex\n",
      "N01597\tRegulation of GF-RTK-RAS-ERK signaling, SPRED and NF1\n",
      "N01598\tRegulation of GF-RTK-RAS-ERK signaling, MRAS-SHOC2-PP1 holophosphatase\n",
      "N01599\tRegulation of GF-RTK-RAS-ERK signaling, ubiquitination of RTK by CBL\n",
      "N01600\tRegulation of GF-RTK-RAS-ERK signaling, RasGAP\n",
      "N01601\tERK-RSK signaling\n",
      "N01602\tERK-MYC signaling pathway\n",
      "N01603\tPyruvate oxidation\n",
      "N01604\tCitrate cycle, first carbon oxidation\n",
      "N01605\tGluconeogenesis\n",
      "N01606\tGlycolysis\n",
      "N01607\tMethionine degradation\n",
      "N01608\tSerine biosynthesis\n",
      "N01609\tCitrate cycle, second carbon oxidation 1\n",
      "N01610\tDihydrolipoamide dehydrogenase\n",
      "N01611\tGlycine cleavage system\n",
      "N01612\tCreatine pathway\n",
      "N01613\tGlycine cleavage system, Gly to MTHF\n",
      "N01614\tActivation of PRC2.2 by ubiquitination of H2AK119 in germline genes\n",
      "N01615\tTransport of creatine\n",
      "N01616\tDihydrolipoamide dehydrogenase\n",
      "N01617\tCitrate cycle, second carbon oxidation 2\n",
      "N01618\tProline biosynthesis, Orn to Pro\n",
      "N01619\tBranched-chain amino acids degradation 2\n",
      "N01620\tBlocking ubiquitination of H2AK119 by CK2\n",
      "N01621\tTNF-RIPK1/3 signaling pathway\n",
      "N01622\tProline degradation\n",
      "N01623\tSpermine biosynthesis\n",
      "N01624\tCholesterol biosynthesis\n",
      "N01625\tCYLD regulation of RIPK1/3\n",
      "N01626\tCholecalciferol biosynthesis\n",
      "N01627\tAdenosine phosphorylation\n",
      "N01628\tCysteine biosynthesis\n",
      "N01629\tRemethylation, THF to 5-MTHF\n",
      "N01630\tRemethylation, Hcy to Met\n",
      "N01631\tTNFSF10-RIPK1/3 signaling pathway\n",
      "N01632\tFASLG-RIPK1/3 signaling pathway\n",
      "N01633\tTLR3-RIPK3 signaling pathway\n",
      "N01634\tTLR4-RIPK3 signaling pathway\n",
      "N01635\tMevalonate pathway\n",
      "N01636\tLoading of the SMC5-SMC6 complex\n",
      "N01637\tCa2+ entry, Voltage-gated Ca2+ channel\n",
      "N01638\tSkeletal-type VGCC-RYR signaling\n",
      "N01639\tCardiac-type VGCC-RYR signaling\n",
      "N01640\tGPCR-PLCB-ITPR signaling pathway\n",
      "N01641\tRTK-PLCG-ITPR signaling pathway\n",
      "N01642\tCa2+ entry, Ligand-gated Ca2+ channel\n",
      "N01643\tCa2+ entry, Store-operated Ca2+ channel\n",
      "N01644\tLysosomal Ca2+ release\n",
      "N01645\tCytosolic Ca2+ removal, SERCA\n",
      "N01646\tRegulation of SERCA\n",
      "N01647\tCa2+/CAM-CN signaling pathway\n",
      "N01648\tCa2+/CAM-CAMK signaling pathway\n",
      "N01649\tCa2+/CAM-VGCC/RYR signaling pathway\n",
      "N01650\tSQSTM1 regulation of RIPK1/3\n",
      "N01651\tBlood group H (O) antigen type 1 biosynthesis\n",
      "N01652\tBlood group A antigen type 1 biosynthesis\n",
      "N01653\tBlood group B antigen type 1 biosynthesis\n",
      "N01654\tForssman blood group antigen biosynthesis\n",
      "N01655\tCa2+-PLCD-ITPR signaling pathway\n",
      "N01656\tGF-RTK-PI3K signaling pathway\n",
      "N01657\tGPCR-PI3K signaling pathway\n",
      "N01658\tGF-RTK-RAS-PI3K signaling pathway\n",
      "N01659\tLewis b antigen biosynthesis\n",
      "N01660\tLewis a antigen biosynthesis\n",
      "N01661\tSialyl lewis a antigen biosynthesis\n",
      "N01662\tIFN-RIPK1/3 signaling pathway\n",
      "N01663\tCASP8 regulation of RIPK1/3\n",
      "N01664\tBlood group A/B Lewis b antigen biosynthesis\n",
      "N01666\tBlood group H (O) antigen type 2 biosynthesis\n",
      "N01667\tBlood group A antigen type 2 biosynthesis\n",
      "N01668\tBlood group B antigen type 2 biosynthesis\n",
      "N01669\tBlood group A/B Lewis y antigen biosynthesis\n",
      "N01670\tBlood group antigen type 3 biosynthesis\n",
      "N01672\tLewis x antigen biosynthesis\n",
      "N01673\tLewis y antigen biosynthesis\n",
      "N01674\tSialyl lewis x antigen biosynthesis\n",
      "N01675\tSID blood group Sd(a) antigen biosynthesis\n",
      "N01676\tP1 antigen biosynthesis\n",
      "N01677\tPX2 antigen biosynthesis\n",
      "N01678\tIi blood group antigen biosynthesis\n",
      "N01679\tPk and P antigens biosynthesis\n",
      "N01680\tNOR antigen biosynthesis\n",
      "N01682\tBlood group A antigen type 4 (Globo-A) biosynthesis\n",
      "N01683\tOh (Bombay), deficiency of ABH antigens\n",
      "N01684\tLipoic acid biosynthesis\n",
      "N01685\tLysine degradation 1\n",
      "N01686\tLysine degradation 2\n",
      "N01687\tLysine degradation 3\n",
      "N01688\tADRB3-UCP1 signaling pathway\n",
      "N01689\tFUT2 nonsecretor\n",
      "N01690\tBlood group H antigen type 4 (Globo-H) biosynthesis\n",
      "N01691\tmitochondrial complex - UCP1 in Thermogenesis\n",
      "N01695\tBCR-BCAP/CD19-PI3K signaling pathway\n",
      "N01696\tICOSLG/ICOS-PI3K signaling pathway\n",
      "N01697\tP/PX2 negative, Pk positive\n",
      "N01698\tP1/Pk/P/NOR all negative (P null)\n",
      "N01699\tP1 negative\n",
      "N01700\tLewis negative, Le (a-b-)\n",
      "N01701\tTranscriptional activation by acetylation of H3K27\n",
      "N01702\tSd(a) negative\n",
      "N01703\tBlood group B antigen type 4 (Globo-B) biosynthesis\n",
      "N01704\tI negative (adult i)\n",
      "N01708\tINS-AKT signaling pathway\n",
      "N01709\tHydrolysis of globoside\n",
      "N01710\tHydrolysis of ganglioside\n",
      "N01711\tHydrolysis of GA1\n",
      "N01712\tHydrolysis of psychosine\n",
      "N01713\tGM2A activation of HEXA and HEXB\n",
      "N01714\tLoss of GM2A activation\n",
      "N01715\tAutophagy-vesicle nucleation/elongation/maturation, PI3P synthesis by PI3KC3-C1\n",
      "N01716\tAutophagy-vesicle nucleation/elongation/maturation, sequestosome-1-like receptor\n",
      "N01717\tRegulation of autophagy-vesicle nucleation/elongation/maturation, ATXN3\n",
      "N01718\tAutophagy-vesicle nucleation/elongation/maturation, PACER-RUBCN-PI3KC3-C2\n",
      "N01719\tAutophagy-vesicle nucleation/elongation/maturation, E3 ubiquitin-ligase Malin\n",
      "N01720\tAutophagosome and lysosome fusion, trans-SNARE\n",
      "N01721\tAutophagosome and lysosome fusion, tethering factor\n",
      "N01722\tAutophagosome and lysosome fusion, tethering factor, GRASP55\n",
      "N01723\tNAD biosynthesis\n",
      "N01724\tNAD+ phosphorylation\n",
      "N01725\tTetrahydrofolate biosynthesis\n",
      "N01726\tFolate cycle\n",
      "N01727\tHistidine degradation\n",
      "N01729\tHistamine biosynthesis\n",
      "N01741\tCa2+/TRPC3 signaling pathway\n",
      "N01743\tRenin-angiotensin signaling pathway\n",
      "N01746\tCCR/CXCR-GNB/G-PI3K signaling pathway\n",
      "N01747\tFind-me signal (nucleotide)\n",
      "N01748\tFind-me signal (LPC)\n",
      "N01749\tFind-me signal (CX3CL1)\n",
      "N01750\tFind-me signal (S1P)\n",
      "N01751\tMacrophage EPO signaling\n",
      "N01752\tTranslocation of phosphatidylserine to the inner leaflet\n",
      "N01753\tExposure of phosphatidylserine to the outer leaflet\n",
      "N01754\tActivation of XKR8\n",
      "N01756\tPINK-Parkin-independent ubiquitin-mediated mitophagy\n",
      "N01757\tPINK-Parkin-independent ubiquitin-mediated mitophagy, ubiquitin E3 ligase\n",
      "N01758\tDesmosome - Vimentin filaments\n",
      "N01759\tINK1-Parkin-mediated MFN2 degradation, VCP-OPA1\n",
      "N01760\tEndosomal Rab cycles\n",
      "N01761\tActivation of CRK-DOCK-Rac1 pathway\n",
      "N01762\tMERTK-mediated recognition and engulfment\n",
      "N01763\tMEGF10-mediated recognition and engulfment\n",
      "N01764\tCalreticulin-LRP1 mediated recognition and engulfment\n",
      "N01765\tCXCR4-GNAQ-PLCB/G signaling pathway\n",
      "N01766\tCX3CR1-GNAI-AC signaling pathway\n",
      "N01767\tCXCR4-GNAI-Src signaling pathway\n",
      "N01768\tCXCR4-GNA12/13 signaling pathway\n",
      "N01769\tCCR5-GNB/G-PLCB/G signaling pathway\n",
      "N01770\tCCR2-GNB/G-PI3K signaling pathway\n",
      "N01771\tCXCR4-GNB/G signaling pathway\n",
      "N01772\tInduction of the PTGS2\n",
      "N01773\tPTGS2-PGE2-TGFB1 pathway\n",
      "N01774\tERK-DUSP4 negative feedback pathway\n",
      "N01775\tInactivation of CaMKII by inducing SERCA2\n",
      "N01776\tCaMK2-p38-MK2-ALOX5 pathway\n",
      "N01777\tEfferocytosis-induced NAD production\n",
      "N01778\tProduction of IL10 via the Sirtuin1 signaling cascade\n",
      "N01779\tContinual efferocytosis enhanced by the AC-derived arginine and ornithine\n",
      "N01780\tHydrolyzing AC-derived cholesterol esters in the lysosome\n",
      "N01781\tActivation of LXRs by oxysterols\n",
      "N01782\tGHRL-GHSR signaling\n",
      "N01783\tNPPA-NPR1 signaling\n",
      "N01784\tGlucose uptake and lactate release induced by efferocytosis\n",
      "N01785\tDon't eat me signal (CD47)\n",
      "N01786\tDon't eat me signal (CD24)\n",
      "N01787\tNPPC-NPR2 signaling\n",
      "N01788\tADIPOQ-ADIPOR signaling pathway\n",
      "N01789\tBetaine metabolism\n",
      "N01790\tTransport of dopamine into the neuron\n",
      "N01791\tGlycine metabolism, Ser to Gly\n",
      "N01792\tEDN-EDNR signaling pathway\n",
      "N01793\tGAL-GALR signaling pathway\n",
      "N01794\tHCRT-HCRTR signaling pathway\n",
      "N01796\tTNFSF4-TNFRSF4 signaling pathway\n",
      "N01797\tEDA-EDAR signaling pathway\n",
      "N01798\tTNFSF11-TNFRSF11A signaling pathway\n",
      "N01799\tCD70-CD27 signaling pathway\n",
      "N01800\tLEP-LEPR signaling pathway\n",
      "N01801\tTNFSF13-TNFRSF13B/C signaling pathway\n",
      "N01802\tDihydrotestosterone biosynthesis\n",
      "N01804\tIL3 family to Jak-STAT signaling pathway\n",
      "N01806\tCobalamin (Vitamin B12) absorption\n",
      "N01807\tTransfer of cobalamin to the portal blood\n",
      "N01808\tIntracellular processing of cobalamin (reduction)\n",
      "N01809\tMutation-caused epigenetic silencing of MMACHC\n",
      "N01810\tRegulation of MMACHC expression\n",
      "N01811\tMitochondrial adenocylation of cobalamin and loading onto MMUT\n",
      "N01812\tCobalamin loading and activation of MTR\n",
      "N01813\tEnhancement of NIPBL loading\n",
      "N01814\tExtracellular matrix - Basal lamina\n",
      "N01815\tVinculin-talin-integrin macromolecular complex\n",
      "N01816\tCostamere\n",
      "N01817\tMyosin thick filament\n",
      "N01818\tActin thin filament, muscle contraction\n",
      "N01819\tActin thin filament, length regulation\n",
      "N01820\tSarcomere, Z-disc\n",
      "N01821\tSarcomere, M-band\n",
      "N01822\tLinker of nucleoskeleton and cytoskeleton (LINC) complex\n",
      "N01823\tFGF23-NCC/NPT signaling pathway\n",
      "N01824\tSGK1-NHERF1+NPT signaling pathway\n",
      "N01831\tRegulation of VWF-GPIb-IX-V interaction, ADAMTS13\n",
      "N01832\tNTN1-MAP1B axon guidance signaling\n",
      "N01833\tDRAXIN-MAP1B axon guidance signaling\n",
      "N01834\tSEMA3A-MAP1B axon guidance signaling\n",
      "N01835\tSEMA3-CRMP2/MAPT axon guidance signaling\n",
      "N01836\tMicrotubule plus end regulation network\n",
      "N01837\tRegulation of neurite extension, NAV1-TRIO\n",
      "N01838\tRegulation of synaptic plasticity, p140Cap\n",
      "N01839\tSevering of microtubule, SPAST/KATN\n",
      "N01840\tSevering of microtubule, KIF2A\n",
      "N01841\tAnterograde axonal transport, Kinesin-2\n",
      "N01842\tAnterograde axonal/dendrite transport, Kinesin-3\n",
      "N01843\tAnterograde dendrite transport, Kinesin-4\n",
      "N01844\tAnterograde dendrite transport, Kinesin-6\n",
      "N01845\tAnterograde axonal/dendrite transport, Kinesin-12\n",
      "N01846\tRetrograde axonal/dendrite transport, Dynein\n",
      "N01847\tRegulation of dynein-mediated retrograde transport\n",
      "N01848\tMembrane-associated periodic skeleton (MPS)\n",
      "N01849\tAxonal actin ring structure\n",
      "N01850\tMYO5B-mediated vesicle transport\n",
      "N01851\tMYO5A-mediated vesicle transport\n",
      "N01852\tMYO6-mediated vesicle transport\n",
      "N01853\tNeurofilament structure\n",
      "N01854\tNeurofilament regulation, ubiqutination by TRIM2\n",
      "N01855\tNeurofilament regulation, ubiqutination by Gigaxonin\n",
      "N01856\tCytomatrix at the active zone (CAZ) protein complex\n",
      "N01857\tSEMA3A-DCX axon guidance signaling\n",
      "N01858\tEFNB1-MAPT axon guidance signaling\n",
      "N01859\tAnterograde axonal/dendrite transport, Kinesin-1\n",
      "N01860\tGPI-anchor remodeling\n",
      "N01867\tDemethylation of dimethylglycine\n",
      "N01868\tDemethylation of sarcosine\n",
      "N01869\tTHF conversion, THF to 5,10-MTHF\n",
      "N01870\tHIF-2A signaling pathway\n",
      "N01871\tHydroxylation of HIF\n",
      "N01872\tProteasomal degradation of HIF by VHL complex\n",
      "N01873\tVHL mutation to HIF-2 signaling pathway\n",
      "N01874\tNRG-ERBB2/ERBB3 pathway (RAS-ERK signaling)\n",
      "N01875\tNRG-ERBB2/ERBB3 pathway (P13K signaling)\n",
      "N01876\tNRG1 fusion to NRG-ERBB2/ERBB3 pathway\n",
      "N01877\tERBB4 mutation to GF-RTK-PI3K signaling pathway\n",
      "N01878\tGlutamate-GRM-GNAQ/S signaling pathway\n",
      "N01879\tGlutamate-GRM-GNAI/O signaling pathway\n",
      "N01880\tGRM1/5-interacting scaffold proteins\n",
      "N01881\tGRM1/5-interacting partners\n",
      "N01882\tTransport of natrium, KA receptor\n",
      "N01883\tTransport of natrium, AMPAR\n",
      "N01884\tTransport of glutamate, EAAT\n",
      "N01885\tTransport of glutamine, SNAT\n",
      "N01886\tGlutamate transport in synapse\n",
      "N01887\tTransport of chloride, GABAA receptor\n",
      "N01888\tGABA-GABBR-GNAI/O signaling pathway\n",
      "N01889\tGbeta/gamma-KCNJ signaling\n",
      "N01890\tGephyrin-containing complex at inhibitory synapse\n",
      "N01891\tGABAA receptor trafficking\n",
      "N01892\tGABA metabolism and transport in glia\n",
      "N01893\tGlutamine metabolism and transport in neuron\n",
      "N01894\tAcetylcholine-CHRM-GNAQ/11 signaling pathway\n",
      "N01895\tTransport of natrium/calcium, CHRN\n",
      "N01896\tAcetylcholine metabolism and transport in neuron\n",
      "N01897\tDopamine-DRD-GNAQ/S signaling pathway\n",
      "N01898\tDopamine-DRD-GNAI/O signaling pathway\n",
      "N01899\tGbeta/gamma-CACNA signaling\n",
      "N01900\tSerotonin-HTR2-GNAQ/11 signaling pathway\n",
      "N01901\tSerotonin-HTR1/5-GNAI/O signaling pathway\n",
      "N01902\tTransport of serotonin, SLC6A4\n",
      "N01903\tNorepinephrine-ADRA2-GNAI/O signaling pathway\n",
      "N01904\tNorepinephrine-ADRB-GNAS signaling pathway\n",
      "N01905\tAC-PKA-HCN signaling\n",
      "N01906\tGlycine transport in neuron\n",
      "N01907\tTransport of chloride, GLR\n",
      "N01908\tADP/UDP-glucose-P2RY-GNAI/O signaling pathway\n",
      "N01909\tTransport of calcium, P2RX\n",
      "N01910\tAdenine nucleotide conversion\n",
      "N01911\tTransport of ATP, SLC17A9\n",
      "N01912\tHistamine metabolism and transport in neuron\n",
      "N01913\tMelanocortin receptor signaling, MSH\n",
      "N01914\tMelanocortin receptor signaling, AgRP\n",
      "N01915\tTachykinin receptor signaling\n",
      "N01916\tPreprohormone cleavage, POMC\n",
      "N01917\tPreprohormone cleavage, PDYN\n",
      "N01918\tDopamine metabolism in astrocyte\n",
      "N01919\tDopamine/Adrenaline metabolism in presynaptic neuron\n",
      "N01920\tTransport of norepinephrine into neuron\n",
      "nt06031\tCitrate cycle and pyruvate metabolism\n",
      "nt06017\tGlycogen metabolism\n",
      "nt06023\tGalactose degradation\n",
      "nt06020\tbeta-Oxidation in mitochondria\n",
      "nt06021\tbeta-Oxidation in peroxisome\n",
      "nt06034\tCholesterol biosynthesis\n",
      "nt06019\tSteroid hormone biosynthesis\n",
      "nt06022\tBile acid biosynthesis\n",
      "nt06014\tSphingolipid degradation\n",
      "nt06027\tPurine salvage pathway\n",
      "nt06033\tGlycine, serine and arginine metabolism\n",
      "nt06030\tMethionine metabolism\n",
      "nt06024\tValine, leucine and isoleucine degradation\n",
      "nt06036\tLysine degradation\n",
      "nt06010\tUrea cycle\n",
      "nt06037\tHistidine metabolism\n",
      "nt06016\tPhenylalanine and tyrosine metabolism\n",
      "nt06028\tDopamine and serotonin metabolism\n",
      "nt06026\tGlutathione biosynthesis\n",
      "nt06015\tN-Glycan biosynthesis\n",
      "nt06013\tO-Glycan biosynthesis\n",
      "nt06029\tGlycosaminoglycan biosynthesis\n",
      "nt06012\tGlycosaminoglycan degradation\n",
      "nt06018\tGPI-anchor biosynthesis\n",
      "nt06035\tBlood group carbohydrate antigen biosynthesis\n",
      "nt06032\tLipoic acid metabolism\n",
      "nt06038\tFolate metabolism\n",
      "nt06025\tMolybdenum cofactor biosynthesis\n",
      "nt06011\tHeme biosynthesis\n",
      "nt06538\tCobalamin transport and metabolism\n",
      "nt06509\tDNA replication\n",
      "nt06510\tTelomere length regulation\n",
      "nt06504\tBase excision repair\n",
      "nt06502\tNucleotide excision repair\n",
      "nt06503\tMismatch repair\n",
      "nt06506\tDouble-strand break repair\n",
      "nt06508\tInterstrand crosslink repair\n",
      "nt06526\tMAPK signaling\n",
      "nt06530\tPI3K signaling\n",
      "nt06505\tWNT signaling\n",
      "nt06511\tNOTCH signaling\n",
      "nt06501\tHH signaling\n",
      "nt06507\tTGFB signaling\n",
      "nt06518\tJAK-STAT signaling\n",
      "nt06516\tTNF signaling\n",
      "nt06528\tCalcium signaling\n",
      "nt06522\tmTOR signaling\n",
      "nt06542\tHIF signaling\n",
      "nt06543\tNRG-ERBB signaling\n",
      "nt06523\tEpigenetic regulation by Polycomb complexes\n",
      "nt06512\tChromosome cohesion and segregation\n",
      "nt06515\tRegulation of kinetochore-microtubule interactions\n",
      "nt06534\tUnfolded protein response\n",
      "nt06532\tAutophagy\n",
      "nt06536\tMitophagy\n",
      "nt06535\tEfferocytosis\n",
      "nt06524\tApoptosis\n",
      "nt06525\tFerroptosis\n",
      "nt06527\tNecroptosis\n",
      "nt06529\tThermogenesis\n",
      "nt06539\tCytoskeleton in muscle cells\n",
      "nt06541\tCytoskeleton in neurons\n",
      "nt06544\tNeuroactive ligand signaling\n",
      "nt06513\tComplement cascade\n",
      "nt06514\tCoagulation cascade\n",
      "nt06517\tTLR signaling\n",
      "nt06521\tNLR signaling\n",
      "nt06519\tRLR signaling\n",
      "nt06520\tCGAS-STING signaling\n",
      "nt06537\tTCR/BCR signaling\n",
      "nt06533\tChemokine signaling\n",
      "nt06310\tCRH-ACTH-cortisol signaling\n",
      "nt06322\tTRH-TSH-TH signaling\n",
      "nt06323\tKISS1-GnRH-LH/FSH-E2 signaling\n",
      "nt06324\tGHRH-GH-IGF signaling\n",
      "nt06318\tCaSR-PTH signaling\n",
      "nt06316\tRenin-angiotensin-aldosterone signaling\n",
      "nt06325\tHormone/cytokine signaling\n",
      "nt06320\tAPOB-LDLR signaling\n",
      "nt06260\tColorectal cancer\n",
      "nt06261\tGastric cancer\n",
      "nt06262\tPancreatic cancer\n",
      "nt06263\tHepatocellular carcinoma\n",
      "nt06264\tRenal cell carcinoma\n",
      "nt06265\tBladder cancer\n",
      "nt06266\tNon-small cell lung cancer\n",
      "nt06267\tSmall cell lung cancer\n",
      "nt06268\tMelanoma\n",
      "nt06269\tBasal cell carcinoma\n",
      "nt06270\tBreast cancer\n",
      "nt06271\tEndometrial cancer\n",
      "nt06272\tProstate cancer\n",
      "nt06273\tGlioma\n",
      "nt06274\tThyroid cancer\n",
      "nt06275\tAcute myeloid leukemia\n",
      "nt06276\tChronic myeloid leukemia\n",
      "nt06210\tERK signaling (cancer)\n",
      "nt06214\tPI3K signaling (cancer)\n",
      "nt06213\tOther RAS signaling (cancer)\n",
      "nt06211\tOther MAPK signaling (cancer)\n",
      "nt06215\tWNT signaling (cancer)\n",
      "nt06216\tNOTCH signaling (cancer)\n",
      "nt06217\tHH signaling (cancer)\n",
      "nt06218\tTGFB signaling (cancer)\n",
      "nt06219\tJAK-STAT signaling (cancer)\n",
      "nt06220\tCalcium signaling (cancer)\n",
      "nt06234\tcAMP signaling (cancer)\n",
      "nt06222\tIFN signaling (cancer)\n",
      "nt06223\tTNF signaling (cancer)\n",
      "nt06224\tCXCR signaling (cancer)\n",
      "nt06225\tHIF-1 signaling (cancer)\n",
      "nt06226\tKEAP1-NRF2 signaling (cancer)\n",
      "nt06227\tNuclear receptor signaling (cancer)\n",
      "nt06229\tMHC presentation (cancer)\n",
      "nt06230\tCell cycle (cancer)\n",
      "nt06231\tApoptosis (cancer)\n",
      "nt06232\tTelomerase activity (cancer)\n",
      "nt06240\tTranscription (cancer)\n",
      "nt06250\tDNA adduct formation (cancer)\n",
      "nt06251\tCYP-mediated ROS formation (cancer)\n",
      "nt06252\tMitochondrial ROS formation (cancer)\n",
      "nt06253\tAntioxidant system (cancer)\n",
      "nt06460\tAlzheimer disease\n",
      "nt06463\tParkinson disease\n",
      "nt06464\tAmyotrophic lateral sclerosis\n",
      "nt06461\tHuntington disease\n",
      "nt06462\tSpinocerebellar ataxia\n",
      "nt06465\tPrion disease\n",
      "nt06466\tPathways of neurodegeneration\n",
      "nt06360\tCushing syndrome\n",
      "nt06160\tHuman T-cell leukemia virus 1 (HTLV-1)\n",
      "nt06161\tHuman immunodeficiency virus 1 (HIV-1)\n",
      "nt06162\tHepatitis B virus (HBV)\n",
      "nt06163\tHepatitis C virus (HCV)\n",
      "nt06171\tSARS coronavirus 2 (SARS-CoV-2)\n",
      "nt06170\tInfluenza A virus (IAV)\n",
      "nt06169\tMeasles virus (MV)\n",
      "nt06168\tHerpes simplex virus 1 (HSV-1)\n",
      "nt06167\tHuman cytomegalovirus (HCMV)\n",
      "nt06164\tKaposi sarcoma-associated herpesvirus (KSHV)\n",
      "nt06165\tEpstein-Barr virus (EBV)\n",
      "nt06166\tHuman papillomavirus (HPV)\n",
      "nt06180\tPathogenic Escherichia coli\n",
      "nt06181\tSalmonella\n",
      "nt06182\tShigella\n",
      "nt06183\tYersinia\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest list network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8bc3095e-6122-46cd-a7ff-7f77cbbaf28f",
   "metadata": {},
   "outputs": [],
   "source": [
    "#kegg_pull pull database network\n",
    "\n",
    "# Pulling all nodes in the network database. Will download it to current working directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "94ea7e25-deb4-4b13-8ca9-9e865f792ccd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "network          KEGG Network Database\n",
      "ne               Release 114.0+/04-11, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 1,637 entries\n",
      "\n",
      "linked db        pathway\n",
      "                 ko\n",
      "                 hsa\n",
      "                 compound\n",
      "                 variant\n",
      "                 disease\n",
      "                 pubmed\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "379395b3-8bb4-4282-9967-3b9305540771",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1415\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest link network pathway | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "d53d358e-7f55-4a49-b277-d6781cabf389",
   "metadata": {},
   "outputs": [],
   "source": [
    "kegg_pull rest link network pathway --output network_pathway.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "40c35d4e-4eee-4c23-97ba-e21410d74c4c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1414 network_pathway.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l network_pathway.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "ead20644-6632-4002-aae7-2e73962dafe8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "path:hsa05225\tne:N00005\n",
      "path:hsa05211\tne:N00005\n",
      "path:hsa05223\tne:N00007\n",
      "path:hsa05216\tne:N00009\n",
      "path:hsa05210\tne:N00012\n",
      "path:hsa05212\tne:N00012\n",
      "path:hsa05226\tne:N00012\n",
      "path:hsa05216\tne:N00012\n",
      "path:hsa05221\tne:N00012\n",
      "path:hsa05213\tne:N00012\n"
     ]
    }
   ],
   "source": [
    "head network_pathway.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a697076f-118f-4151-8720-4b0bcda35a5d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "62077ac1-2eb4-421c-8133-8da3610b0c3b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1306\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest link network disease | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "4ad19eb5-09fd-4f88-8d22-ab04b1b0d12f",
   "metadata": {},
   "outputs": [],
   "source": [
    "kegg_pull rest link network disease --output network_disease.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "460b3d26-0221-4347-abd4-860dd6b3e125",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1305 network_disease.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l network_disease.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "a2841f84-d8fb-445d-8bef-47155a13cb3e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ds:H01489\tne:nt06018\n",
      "ds:H01486\tne:nt06018\n",
      "ds:H01488\tne:nt06018\n",
      "ds:H01487\tne:nt06018\n",
      "ds:H01127\tne:nt06018\n",
      "ds:H01485\tne:nt06018\n",
      "ds:H00216\tne:nt06019\n",
      "ds:H02314\tne:nt06019\n",
      "ds:H00259\tne:nt06019\n",
      "ds:H01111\tne:nt06019\n"
     ]
    }
   ],
   "source": [
    "head network_disease.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8363fad9-f5b6-42b6-a3b6-dac6c86af932",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90785681-45ac-4f65-85a4-5fb5852acce2",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "3a86642c-9caa-4ba4-a493-1811bb060cd0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████| 1/1 [00:01<00:00,  1.37s/it]\n"
     ]
    }
   ],
   "source": [
    "kegg_pull pull entry-ids H01489"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81ff19a1-1016-4658-909a-7aaef6790c19",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "765b890a-bf17-43cc-8ac9-6e0b6b497e16",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      76\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest link disease pathway | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "1744f449-033c-482d-9fd7-60f37a319fc5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "path:hsa05211\tds:H00021\n",
      "path:hsa05110\tds:H00110\n",
      "path:hsa05220\tds:H00004\n",
      "path:hsa05210\tds:H00020\n",
      "path:hsa05212\tds:H00019\n",
      "path:hsa05217\tds:H00039\n",
      "path:hsa05130\tds:H00277\n",
      "path:hsa05130\tds:H00278\n",
      "path:hsa05332\tds:H00084\n",
      "path:hsa05132\tds:H00111\n",
      "path:hsa05223\tds:H00014\n",
      "path:hsa05135\tds:H00298\n",
      "path:hsa05214\tds:H00042\n",
      "path:hsa05221\tds:H00003\n",
      "path:hsa05166\tds:H00009\n",
      "path:hsa05226\tds:H00018\n",
      "path:hsa05224\tds:H00031\n",
      "path:hsa05216\tds:H00032\n",
      "path:hsa05161\tds:H00412\n",
      "path:hsa05144\tds:H00361\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest link disease pathway | head -20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "4931f2d0-e53e-4db3-b3ed-0b0d875d3384",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pathway          KEGG Pathway Database\n",
      "path             Release 114.0+/04-11, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 579 entries\n",
      "\n",
      "linked db        module\n",
      "                 ko\n",
      "                 <org>\n",
      "                 genome\n",
      "                 compound\n",
      "                 glycan\n",
      "                 reaction\n",
      "                 rclass\n",
      "                 enzyme\n",
      "                 network\n",
      "                 disease\n",
      "                 drug\n",
      "                 pubmed\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info pathway"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "a12375f3-3bcb-4f00-8662-01e6b941cfbb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "disease          KEGG Disease Database\n",
      "ds               Release 114.0+/04-11, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 2,900 entries\n",
      "\n",
      "linked db        pathway\n",
      "                 brite\n",
      "                 ko\n",
      "                 hsa\n",
      "                 genome\n",
      "                 network\n",
      "                 variant\n",
      "                 drug\n",
      "                 pubmed\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info disease"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "c6dff0fa-7a12-4e0e-8af6-76f7d37adb71",
   "metadata": {},
   "outputs": [],
   "source": [
    "kegg_pull rest list network --output kegg_network.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf0aa25e-dfae-4bef-aaaa-a45d3417c125",
   "metadata": {},
   "source": [
    "## Getting the number of reference vs disease networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "488d4d96-4f3f-4725-930f-77efb4aac6a8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████| 1637/1637 [11:53<00:00,  2.30it/s]\n"
     ]
    }
   ],
   "source": [
    "kegg_pull pull database network --output kegg_network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "3c0adc9f-544a-4f52-9e67-232798caa6b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Output file\n",
    "output=\"kegg_network_types.tsv\"\n",
    "> \"$output\"  # Clear or create the file\n",
    "\n",
    "# Iterate over each .txt file in the kegg_network directory\n",
    "for file in kegg_network/*.txt; do\n",
    "    # Get the filename without path and extension\n",
    "    base=$(basename \"$file\" .txt)\n",
    "\n",
    "    # Extract the line containing TYPE\n",
    "    type_line=$(grep \"TYPE\" \"$file\")\n",
    "\n",
    "    # Extract the TYPE line and remove the word \"TYPE\" and any whitespace\n",
    "    type_value=$(grep \"^TYPE\" \"$file\" | sed 's/TYPE[ \\t]*//')\n",
    "\n",
    "    # Write to the output file\n",
    "    echo -e \"${base}\\t${type_value}\" >> \"$output\"\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "0ac320a9-9079-4fb6-81fd-975977a82db0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.17it/s]\n",
      "hsa_var:1950v1\n",
      "ENTRY       1950v1                      Variant\n",
      "NAME        EGF overexpression\n",
      "TYPE        Gain of function\n",
      "GENE        EGF  epidermal growth factor [KO:K04357]\n",
      "ORGANISM    hsa_var Human gene variants (Homo sapiens)\n",
      "VARIATION   overexpression\n",
      "NETWORK     nt06210  ERK signaling (cancer)\n",
      "            nt06214  PI3K signaling (cancer)\n",
      "            nt06260  Colorectal cancer\n",
      "            nt06526  MAPK signaling\n",
      "            nt06530  PI3K signaling\n",
      "DISEASE     H00020  Colorectal cancer\n",
      "REFERENCE   PMID:7912978\n",
      "  AUTHORS   Hayashi Y, Widjono YW, Ohta K, Hanioka K, Obayashi C, Itoh K, Imai Y, Itoh H\n",
      "  TITLE     Expression of EGF, EGF-receptor, p53, v-erb B and ras p21 in colorectal neoplasms by immunostaining paraffin-embedded tissues.\n",
      "  JOURNAL   Pathol Int 44:124-30 (1994)\n",
      "            DOI:10.1111/j.1440-1827.1994.tb01696.x\n",
      "REFERENCE   PMID:15668269\n",
      "  AUTHORS   Spano JP, Fagard R, Soria JC, Rixe O, Khayat D, Milano G\n",
      "  TITLE     Epidermal growth factor receptor signaling in colorectal cancer: preclinical data and therapeutic perspectives.\n",
      "  JOURNAL   Ann Oncol 16:189-94 (2005)\n",
      "            DOI:10.1093/annonc/mdi057\n",
      "///\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull pull entry-ids hsa_var:1950v1 --print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "9dd433af-9634-4dd9-bdb1-e04d6f0610f1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ENTRY       1950              CDS       T01001\n",
      "SYMBOL      EGF, HOMG4, URG\n",
      "NAME        (RefSeq) epidermal growth factor\n",
      "ORTHOLOGY   K04357  epidermal growth factor\n",
      "ORGANISM    hsa  Homo sapiens (human)\n",
      "PATHWAY     hsa01521  EGFR tyrosine kinase inhibitor resistance\n",
      "            hsa04010  MAPK signaling pathway\n",
      "            hsa04012  ErbB signaling pathway\n",
      "            hsa04014  Ras signaling pathway\n",
      "            hsa04015  Rap1 signaling pathway\n",
      "            hsa04020  Calcium signaling pathway\n",
      "            hsa04066  HIF-1 signaling pathway\n",
      "            hsa04068  FoxO signaling pathway\n",
      "            hsa04072  Phospholipase D signaling pathway\n",
      "            hsa04151  PI3K-Akt signaling pathway\n",
      "            hsa04510  Focal adhesion\n",
      "            hsa04540  Gap junction\n",
      "            hsa04630  JAK-STAT signaling pathway\n",
      "            hsa04810  Regulation of actin cytoskeleton\n",
      "            hsa05160  Hepatitis C\n",
      "            hsa05165  Human papillomavirus infection\n",
      "            hsa05200  Pathways in cancer\n",
      "            hsa05207  Chemical carcinogenesis - receptor activation\n",
      "            hsa05208  Chemical carcinogenesis - reactive oxygen species\n",
      "            hsa05210  Colorectal cancer\n",
      "            hsa05212  Pancreatic cancer\n",
      "            hsa05213  Endometrial cancer\n",
      "            hsa05214  Glioma\n",
      "            hsa05215  Prostate cancer\n",
      "            hsa05218  Melanoma\n",
      "            hsa05219  Bladder cancer\n",
      "            hsa05223  Non-small cell lung cancer\n",
      "            hsa05224  Breast cancer\n",
      "            hsa05226  Gastric cancer\n",
      "            hsa05231  Choline metabolism in cancer\n",
      "            hsa05235  PD-L1 expression and PD-1 checkpoint pathway in cancer\n",
      "NETWORK     nt06160  Human T-cell leukemia virus 1 (HTLV-1)\n",
      "            nt06162  Hepatitis B virus (HBV)\n",
      "            nt06163  Hepatitis C virus (HCV)\n",
      "            nt06164  Kaposi sarcoma-associated herpesvirus (KSHV)\n",
      "            nt06165  Epstein-Barr virus (EBV)\n",
      "            nt06166  Human papillomavirus (HPV)\n",
      "            nt06167  Human cytomegalovirus (HCMV)\n",
      "            nt06170  Influenza A virus (IAV)\n",
      "            nt06180  Pathogenic Escherichia coli\n",
      "            nt06182  Shigella\n",
      "            nt06210  ERK signaling (cancer)\n",
      "            nt06213  Other RAS signaling (cancer)\n",
      "            nt06214  PI3K signaling (cancer)\n",
      "            nt06219  JAK-STAT signaling (cancer)\n",
      "            nt06220  Calcium signaling (cancer)\n",
      "            nt06227  Nuclear receptor signaling (cancer)\n",
      "            nt06260  Colorectal cancer\n",
      "            nt06261  Gastric cancer\n",
      "            nt06262  Pancreatic cancer\n",
      "            nt06263  Hepatocellular carcinoma\n",
      "            nt06265  Bladder cancer\n",
      "            nt06266  Non-small cell lung cancer\n",
      "            nt06268  Melanoma\n",
      "            nt06270  Breast cancer\n",
      "            nt06271  Endometrial cancer\n",
      "            nt06273  Glioma\n",
      "            nt06274  Thyroid cancer\n",
      "            nt06276  Chronic myeloid leukemia\n",
      "            nt06526  MAPK signaling\n",
      "            nt06528  Calcium signaling\n",
      "            nt06530  PI3K signaling\n",
      "  ELEMENT   N00001  EGF-EGFR-RAS-ERK signaling pathway\n",
      "            N00021  EGF-ERBB2-RAS-ERK signaling pathway\n",
      "            N00022  ERBB2-overexpression to RAS-ERK signaling pathway\n",
      "            N00023  EGF-EGFR-PLCG-ERK signaling pathway\n",
      "            N00026  EGF-EGFR-PLCG-CAMK signaling pathway\n",
      "            N00030  EGF-EGFR-RAS-PI3K signaling pathway\n",
      "            N00033  EGF-EGFR-PI3K signaling pathway\n",
      "            N00034  ERBB2-overexpression to PI3K signaling pathway\n",
      "            N00094  EGF-Jak-STAT signaling pathway\n",
      "            N00095  ERBB2-overexpression to EGF-Jak-STAT signaling pathway\n",
      "            N00096  EGF-EGFR-RAS-RASSF1 signaling pathway\n",
      "            N00103  EGF-EGFR-RAS-RalGDS signaling pathway\n",
      "            N00147  EGF-EGFR-PLCG-calcineurin signaling pathway\n",
      "            N00252  Amplified ERBB2 to RAS-ERK signaling pathway\n",
      "            N00253  Amplified ERBB2 to PI3K signaling pathway\n",
      "            N00276  EGF-overexpression to RAS-ERK signaling pathway\n",
      "            N00281  EGF-overexpression to PI3K signaling pathway\n",
      "            N00390  EGF-EGFR-PI3K-NFKB signaling pathway\n",
      "            N00542  EGF-EGFR-RAS-JNK signaling pathway\n",
      "            N01078  EGF-EGFR-Actin signaling pathway\n",
      "            N01364  E2 to nuclear-initiated estrogen signaling pathway\n",
      "            N01592  GF-RTK-RAS-ERK signaling pathway\n",
      "            N01641  RTK-PLCG-ITPR signaling pathway\n",
      "            N01656  GF-RTK-PI3K signaling pathway\n",
      "            N01658  GF-RTK-RAS-PI3K signaling pathway\n",
      "DISEASE     H00020  Colorectal cancer\n",
      "            H01210  Hypomagnesemia\n",
      "BRITE       KEGG Orthology (KO) [BR:hsa00001]\n",
      "             09130 Environmental Information Processing\n",
      "              09132 Signal transduction\n",
      "               04014 Ras signaling pathway\n",
      "                1950 (EGF)\n",
      "               04015 Rap1 signaling pathway\n",
      "                1950 (EGF)\n",
      "               04630 JAK-STAT signaling pathway\n",
      "                1950 (EGF)\n",
      "               04066 HIF-1 signaling pathway\n",
      "                1950 (EGF)\n",
      "               04068 FoxO signaling pathway\n",
      "                1950 (EGF)\n",
      "               04072 Phospholipase D signaling pathway\n",
      "                1950 (EGF)\n",
      "               04151 PI3K-Akt signaling pathway\n",
      "                1950 (EGF)\n",
      "             09160 Human Diseases\n",
      "              09161 Cancer: overview\n",
      "               05200 Pathways in cancer\n",
      "                1950 (EGF)\n",
      "               05207 Chemical carcinogenesis - receptor activation\n",
      "                1950 (EGF)\n",
      "               05208 Chemical carcinogenesis - reactive oxygen species\n",
      "                1950 (EGF)\n",
      "               05231 Choline metabolism in cancer\n",
      "                1950 (EGF)\n",
      "               05235 PD-L1 expression and PD-1 checkpoint pathway in cancer\n",
      "                1950 (EGF)\n",
      "              09162 Cancer: specific types\n",
      "               05210 Colorectal cancer\n",
      "                1950 (EGF)\n",
      "               05212 Pancreatic cancer\n",
      "                1950 (EGF)\n",
      "               05226 Gastric cancer\n",
      "                1950 (EGF)\n",
      "               05214 Glioma\n",
      "                1950 (EGF)\n",
      "               05218 Melanoma\n",
      "                1950 (EGF)\n",
      "               05219 Bladder cancer\n",
      "                1950 (EGF)\n",
      "               05215 Prostate cancer\n",
      "                1950 (EGF)\n",
      "               05213 Endometrial cancer\n",
      "                1950 (EGF)\n",
      "               05224 Breast cancer\n",
      "                1950 (EGF)\n",
      "               05223 Non-small cell lung cancer\n",
      "                1950 (EGF)\n",
      "              09172 Infectious disease: viral\n",
      "               05160 Hepatitis C\n",
      "                1950 (EGF)\n",
      "               05165 Human papillomavirus infection\n",
      "                1950 (EGF)\n",
      "              09176 Drug resistance: antineoplastic\n",
      "               01521 EGFR tyrosine kinase inhibitor resistance\n",
      "                1950 (EGF)\n",
      "             09180 Brite Hierarchies\n",
      "              09183 Protein families: signaling and cellular processes\n",
      "               04052 Cytokines and neuropeptides [BR:hsa04052]\n",
      "                1950 (EGF)\n",
      "            Cytokines and neuropeptides [BR:hsa04052]\n",
      "             Cytokines\n",
      "              Growth factors (RTK binding)\n",
      "               1950 (EGF)\n",
      "POSITION    4:109912883..110013766\n",
      "MOTIF       Pfam: Ldl_recept_b FXa_inhibition cEGF EGF EGF_CA EGF_3 DUF5050 Vgb_lyase Plasmod_Pvs28\n",
      "DBLINKS     NCBI-GeneID: 1950\n",
      "            NCBI-ProteinID: NP_001954\n",
      "            OMIM: 131530\n",
      "            HGNC: 3229\n",
      "            Ensembl: ENSG00000138798\n",
      "            UniProt: P01133\n",
      "STRUCTURE   PDB\n",
      "AASEQ       1207\n",
      "            MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGPAPFLIFSHGNSIFRID\n",
      "            TEGTNYEQLVVDAGVSVIMDFHYNEKRIYWVDLERQLLQRVFLNGSRQERVCNIEKNVSG\n",
      "            MAINWINEEVIWSNQQEGIITVTDMKGNNSHILLSALKYPANVAVDPVERFIFWSSEVAG\n",
      "            SLYRADLDGVGVKALLETSEKITAVSLDVLDKRLFWIQYNREGSNSLICSCDYDGGSVHI\n",
      "            SKHPTQHNLFAMSLFGDRIFYSTWKMKTIWIANKHTGKDMVRINLHSSFVPLGELKVVHP\n",
      "            LAQPKAEDDTWEPEQKLCKLRKGNCSSTVCGQDLQSHLCMCAEGYALSRDRKYCEDVNEC\n",
      "            AFWNHGCTLGCKNTPGSYYCTCPVGFVLLPDGKRCHQLVSCPRNVSECSHDCVLTSEGPL\n",
      "            CFCPEGSVLERDGKTCSGCSSPDNGGCSQLCVPLSPVSWECDCFPGYDLQLDEKSCAASG\n",
      "            PQPFLLFANSQDIRHMHFDGTDYGTLLSQQMGMVYALDHDPVENKIYFAHTALKWIERAN\n",
      "            MDGSQRERLIEEGVDVPEGLAVDWIGRRFYWTDRGKSLIGRSDLNGKRSKIITKENISQP\n",
      "            RGIAVHPMAKRLFWTDTGINPRIESSSLQGLGRLVIASSDLIWPSGITIDFLTDKLYWCD\n",
      "            AKQSVIEMANLDGSKRRRLTQNDVGHPFAVAVFEDYVWFSDWAMPSVMRVNKRTGKDRVR\n",
      "            LQGSMLKPSSLVVVHPLAKPGADPCLYQNGGCEHICKKRLGTAWCSCREGFMKASDGKTC\n",
      "            LALDGHQLLAGGEVDLKNQVTPLDILSKTRVSEDNITESQHMLVAEIMVSDQDDCAPVGC\n",
      "            SMYARCISEGEDATCQCLKGFAGDGKLCSDIDECEMGVPVCPPASSKCINTEGGYVCRCS\n",
      "            EGYQGDGIHCLDIDECQLGEHSCGENASCTNTEGGYTCMCAGRLSEPGLICPDSTPPPHL\n",
      "            REDDHHYSVRNSDSECPLSHDGYCLHDGVCMYIEALDKYACNCVVGYIGERCQYRDLKWW\n",
      "            ELRHAGHGQQQKVIVVAVCVVVLVMLLLLSLWGAHYYRTQKLLSKNPKNPYEESSRDVRS\n",
      "            RRPADTEDGMSSCPQPWFVVIKEHQDLKNGGQPVAGEDGQAADGSMQPTSWRQEPQLCGM\n",
      "            GTEQGCWIPVSSDKGSCPQVMERSFHMPSYGTQTLEGGVEKPHSLLSANPLWQQRALDPP\n",
      "            HQMELTQ\n",
      "NTSEQ       3624\n",
      "            atgctgctcactcttatcattctgttgccagtagtttcaaaatttagttttgttagtctc\n",
      "            tcagcaccgcagcactggagctgtcctgaaggtactctcgcaggaaatgggaattctact\n",
      "            tgtgtgggtcctgcacccttcttaattttctcccatggaaatagtatctttaggattgac\n",
      "            acagaaggaaccaattatgagcaattggtggtggatgctggtgtctcagtgatcatggat\n",
      "            tttcattataatgagaaaagaatctattgggtggatttagaaagacaacttttgcaaaga\n",
      "            gtttttctgaatgggtcaaggcaagagagagtatgtaatatagagaaaaatgtttctgga\n",
      "            atggcaataaattggataaatgaagaagttatttggtcaaatcaacaggaaggaatcatt\n",
      "            acagtaacagatatgaaaggaaataattcccacattcttttaagtgctttaaaatatcct\n",
      "            gcaaatgtagcagttgatccagtagaaaggtttatattttggtcttcagaggtggctgga\n",
      "            agcctttatagagcagatctcgatggtgtgggagtgaaggctctgttggagacatcagag\n",
      "            aaaataacagctgtgtcattggatgtgcttgataagcggctgttttggattcagtacaac\n",
      "            agagaaggaagcaattctcttatttgctcctgtgattatgatggaggttctgtccacatt\n",
      "            agtaaacatccaacacagcataatttgtttgcaatgtccctttttggtgaccgtatcttc\n",
      "            tattcaacatggaaaatgaagacaatttggatagccaacaaacacactggaaaggacatg\n",
      "            gttagaattaacctccattcatcatttgtaccacttggtgaactgaaagtagtgcatcca\n",
      "            cttgcacaacccaaggcagaagatgacacttgggagcctgagcagaaactttgcaaattg\n",
      "            aggaaaggaaactgcagcagcactgtgtgtgggcaagacctccagtcacacttgtgcatg\n",
      "            tgtgcagagggatacgccctaagtcgagaccggaagtactgtgaagatgttaatgaatgt\n",
      "            gctttttggaatcatggctgtactcttgggtgtaaaaacacccctggatcctattactgc\n",
      "            acgtgccctgtaggatttgttctgcttcctgatgggaaacgatgtcatcaacttgtttcc\n",
      "            tgtccacgcaatgtgtctgaatgcagccatgactgtgttctgacatcagaaggtccctta\n",
      "            tgtttctgtcctgaaggctcagtgcttgagagagatgggaaaacatgtagcggttgttcc\n",
      "            tcacccgataatggtggatgtagccagctctgcgttcctcttagcccagtatcctgggaa\n",
      "            tgtgattgctttcctgggtatgacctacaactggatgaaaaaagctgtgcagcttcagga\n",
      "            ccacaaccatttttgctgtttgccaattctcaagatattcgacacatgcattttgatgga\n",
      "            acagactatggaactctgctcagccagcagatgggaatggtttatgccctagatcatgac\n",
      "            cctgtggaaaataagatatactttgcccatacagccctgaagtggatagagagagctaat\n",
      "            atggatggttcccagcgagaaaggcttattgaggaaggagtagatgtgccagaaggtctt\n",
      "            gctgtggactggattggccgtagattctattggacagacagagggaaatctctgattgga\n",
      "            aggagtgatttaaatgggaaacgttccaaaataatcactaaggagaacatctctcaacca\n",
      "            cgaggaattgctgttcatccaatggccaagagattattctggactgatacagggattaat\n",
      "            ccacgaattgaaagttcttccctccaaggccttggccgtctggttatagccagctctgat\n",
      "            ctaatctggcccagtggaataacgattgacttcttaactgacaagttgtactggtgcgat\n",
      "            gccaagcagtctgtgattgaaatggccaatctggatggttcaaaacgccgaagacttacc\n",
      "            cagaatgatgtaggtcacccatttgctgtagcagtgtttgaggattatgtgtggttctca\n",
      "            gattgggctatgccatcagtaatgagagtaaacaagaggactggcaaagatagagtacgt\n",
      "            ctccaaggcagcatgctgaagccctcatcactggttgtggttcatccattggcaaaacca\n",
      "            ggagcagatccctgcttatatcaaaacggaggctgtgaacatatttgcaaaaagaggctt\n",
      "            ggaactgcttggtgttcgtgtcgtgaaggttttatgaaagcctcagatgggaaaacgtgt\n",
      "            ctggctctggatggtcatcagctgttggcaggtggtgaagttgatctaaagaaccaagta\n",
      "            acaccattggacatcttgtccaagactagagtgtcagaagataacattacagaatctcaa\n",
      "            cacatgctagtggctgaaatcatggtgtcagatcaagatgactgtgctcctgtgggatgc\n",
      "            agcatgtatgctcggtgtatttcagagggagaggatgccacatgtcagtgtttgaaagga\n",
      "            tttgctggggatggaaaactatgttctgatatagatgaatgtgagatgggtgtcccagtg\n",
      "            tgcccccctgcctcctccaagtgcatcaacaccgaaggtggttatgtctgccggtgctca\n",
      "            gaaggctaccaaggagatgggattcactgtcttgatattgatgagtgccaactgggggag\n",
      "            cacagctgtggagagaatgccagctgcacaaatacagagggaggctatacctgcatgtgt\n",
      "            gctggacgcctgtctgaaccaggactgatttgccctgactctactccaccccctcacctc\n",
      "            agggaagatgaccaccactattccgtaagaaatagtgactctgaatgtcccctgtcccac\n",
      "            gatgggtactgcctccatgatggtgtgtgcatgtatattgaagcattggacaagtatgca\n",
      "            tgcaactgtgttgttggctacatcggggagcgatgtcagtaccgagacctgaagtggtgg\n",
      "            gaactgcgccacgctggccacgggcagcagcagaaggtcatcgtggtggctgtctgcgtg\n",
      "            gtggtgcttgtcatgctgctcctcctgagcctgtggggggcccactactacaggactcag\n",
      "            aagctgctatcgaaaaacccaaagaatccttatgaggagtcgagcagagatgtgaggagt\n",
      "            cgcaggcctgctgacactgaggatgggatgtcctcttgccctcaaccttggtttgtggtt\n",
      "            ataaaagaacaccaagacctcaagaatgggggtcaaccagtggctggtgaggatggccag\n",
      "            gcagcagatgggtcaatgcaaccaacttcatggaggcaggagccccagttatgtggaatg\n",
      "            ggcacagagcaaggctgctggattccagtatccagtgataagggctcctgtccccaggta\n",
      "            atggagcgaagctttcatatgccctcctatgggacacagacccttgaagggggtgtcgag\n",
      "            aagccccattctctcctatcagctaacccattatggcaacaaagggccctggacccacca\n",
      "            caccaaatggagctgactcagtga\n",
      "///\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest get hsa:1950"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "926ac49a-8cc3-472c-ae07-4ecd4b70f5aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T01001           Homo sapiens (human) KEGG Genes Database\n",
      "hsa              Release 114.0+/04-11, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 24,685 entries\n",
      "\n",
      "linked db        pathway\n",
      "                 brite\n",
      "                 module\n",
      "                 ko\n",
      "                 genome\n",
      "                 enzyme\n",
      "                 network\n",
      "                 disease\n",
      "                 drug\n",
      "                 ncbi-geneid\n",
      "                 ncbi-proteinid\n",
      "                 uniprot\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info hsa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "b58ec9a4-dce3-4900-8021-9bc9155a925e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "variant          KEGG Variant Database\n",
      "hsa_var          Release 114.0+/04-12, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 1,536 entries\n",
      "\n",
      "linked db        network\n",
      "                 disease\n",
      "                 drug\n",
      "                 pubmed\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info variant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "cd56ab73-c7ab-4e9f-bf79-c1f2fe6c3b40",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10000v1\tAKT3 mutation\n",
      "10026v1\tPIGK deficiency\n",
      "10075v1\tHUWE1 mutation\n",
      "100v1\tADA deficiency\n",
      "10111v1\tRAD50 mutation\n",
      "10133v1\tOPTN mutation\n",
      "10133v2\tOPTN activating mutation\n",
      "10157v1\tAASS deficiency\n",
      "10195v1\tALG3 deficiency\n",
      "1019v1\tCDK4 amplification\n",
      "1019v2\tCDK4 mutation\n",
      "10243v1\tGPHN deficiency\n",
      "10274v1\tSTAG1 mutation\n",
      "1027v1\tCDKN1B loss\n",
      "1027v2\tCDKN1B reduced expression\n",
      "1027v3\tCDKN1B mutation\n",
      "10280v1\tSIGMAR1 mutation\n",
      "10293v1\tTRAIP mutation\n",
      "10297v1\tAPC2 mutation\n",
      "1029v1\tCDKN2A deletion\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest list variant | head -20"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "771ddf6d-dafc-4368-a5a3-0b6a2abdeeb3",
   "metadata": {},
   "source": [
    "## Subsetting data to the Variant set of the networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "bc2e85c1-71d8-4c68-b0be-00ec72217cc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir network_variant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "590d7a64-857a-44bc-9251-90cdbc1d4181",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "cp kegg_network/$p.txt network_variant/\n",
    "\n",
    "done < network_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "fce29a29-33ca-4a03-89e2-2ed4aafc929f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     298\n"
     ]
    }
   ],
   "source": [
    "ls network_variant/* | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "a339aed9-8ef3-421d-a454-5aea56b0124c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/bin/bash\n",
    "\n",
    "output=\"gene_variants.tsv\"\n",
    "> \"$output\"  # Clear the output file\n",
    "\n",
    "for file in network_variant/*.txt; do\n",
    "    base=$(basename \"$file\" .txt)\n",
    "\n",
    "    # Find and extract all matches of digits-v-digits\n",
    "    grep -oE \"[0-9]+v[0-9]+\" \"$file\" | while read -r match; do\n",
    "        echo -e \"${base}\\t${match}\" >> \"$output\"\n",
    "    done\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "279cc34b-6bd9-46d2-be11-785e9793ecfe",
   "metadata": {},
   "outputs": [],
   "source": [
    "sort gene_variants.tsv | uniq > temp.tsv && mv temp.tsv gene_variants.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "88ffb62a-be1a-4e05-854c-b061d596985c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     328 gene_variants.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l gene_variants.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "e1895ec2-6791-4daf-bcbf-3817e2e3a963",
   "metadata": {},
   "outputs": [],
   "source": [
    "cut -f 2 gene_variants.tsv > gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "21e29f30-fecb-4360-ae18-278757bdbd0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sort gene_variants.txt | uniq > temp.tsv && mv temp.tsv gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "015831af-db64-4386-b595-b2ceab4369d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     200 gene_variants.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "4ddf080a-1786-44ea-acdd-e7e474952ee3",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed -i '' 's/^/hsa_var:/' gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "5bb68dd7-08e6-4c3d-99c0-6131df914af6",
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir variant_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "f225e592-665e-4442-a385-73addd61b902",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|█████████████████████████████████████████| 200/200 [00:51<00:00,  3.85it/s]\n"
     ]
    }
   ],
   "source": [
    "cat gene_variants.txt | kegg_pull pull entry-ids - --output=variant_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "679f5a1b-f4d1-4585-8a11-93bf81fcf795",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat variant_info/* > all_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd86096d-cf04-4117-b4c2-9736327dcc86",
   "metadata": {},
   "outputs": [],
   "source": [
    "cp all_variants.txt all_variants_filtered.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "810dd902-7dad-4fe7-b028-7a865c9d35d6",
   "metadata": {},
   "source": [
    "### Switching to python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bc079e4-bf59-404a-9f4b-c8b2b706a03a",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6851a309-c86f-477f-a355-feb3e959aa48",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def remove_references(text):\n",
    "    # This regex matches 'REFERENCE' lines and all subsequent indented lines (those starting with 2+ spaces)\n",
    "    cleaned_text = re.sub(r'REFERENCE\\s+PMID:\\d+\\n(?: {2}.*\\n)*', '', text)\n",
    "    return cleaned_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "98db6b07-480f-42e7-8b45-6f710a33b6ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('all_variants_filtered.txt', 'r') as f:\n",
    "    original_text = f.read()\n",
    "\n",
    "cleaned_text = remove_references(original_text)\n",
    "\n",
    "with open('all_variants_filtered.txt', 'w') as f:\n",
    "    f.write(cleaned_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfeab2a6-81b9-4d3a-b1bc-a6491294dd15",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "dc011a9b-7c1b-40c8-86f8-1c1b28bae6b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_network(text):\n",
    "    lines = text.split('\\n')\n",
    "    cleaned_lines = []\n",
    "    skip_block = False\n",
    "\n",
    "    for line in lines:\n",
    "        if line.startswith(\"NETWORK\"):\n",
    "            skip_block = True\n",
    "            continue\n",
    "        if skip_block:\n",
    "            if line.startswith(\" \") or line.startswith(\"\\t\"):\n",
    "                continue\n",
    "            else:\n",
    "                skip_block = False\n",
    "        if not skip_block:\n",
    "            cleaned_lines.append(line)\n",
    "\n",
    "    return '\\n'.join(cleaned_lines)\n",
    "\n",
    "\n",
    "with open('all_variants_filtered.txt', 'r') as f:\n",
    "    original_text = f.read()\n",
    "\n",
    "cleaned_text = remove_network(original_text)\n",
    "\n",
    "with open('all_variants_filtered.txt', 'w') as f:\n",
    "    f.write(cleaned_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "65f14020-a3b1-4c71-a529-421eb66c70f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_network(text):\n",
    "    lines = text.split('\\n')\n",
    "    cleaned_lines = []\n",
    "    skip_block = False\n",
    "\n",
    "    for line in lines:\n",
    "        if line.startswith(\"DISEASE\"):\n",
    "            skip_block = True\n",
    "            continue\n",
    "        if skip_block:\n",
    "            if line.startswith(\" \") or line.startswith(\"\\t\"):\n",
    "                continue\n",
    "            else:\n",
    "                skip_block = False\n",
    "        if not skip_block:\n",
    "            cleaned_lines.append(line)\n",
    "\n",
    "    return '\\n'.join(cleaned_lines)\n",
    "\n",
    "\n",
    "with open('all_variants_filtered.txt', 'r') as f:\n",
    "    original_text = f.read()\n",
    "\n",
    "cleaned_text = remove_network(original_text)\n",
    "\n",
    "with open('all_variants_filtered.txt', 'w') as f:\n",
    "    f.write(cleaned_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c17ee2ae-a9ae-4ff3-b8ae-6a8bbc40e758",
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_network(text):\n",
    "    lines = text.split('\\n')\n",
    "    cleaned_lines = []\n",
    "    skip_block = False\n",
    "\n",
    "    for line in lines:\n",
    "        if line.startswith(\"DRUG_TARGET\"):\n",
    "            skip_block = True\n",
    "            continue\n",
    "        if skip_block:\n",
    "            if line.startswith(\" \") or line.startswith(\"\\t\"):\n",
    "                continue\n",
    "            else:\n",
    "                skip_block = False\n",
    "        if not skip_block:\n",
    "            cleaned_lines.append(line)\n",
    "\n",
    "    return '\\n'.join(cleaned_lines)\n",
    "\n",
    "\n",
    "with open('all_variants_filtered.txt', 'r') as f:\n",
    "    original_text = f.read()\n",
    "\n",
    "cleaned_text = remove_network(original_text)\n",
    "\n",
    "with open('all_variants_filtered.txt', 'w') as f:\n",
    "    f.write(cleaned_text)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "114de6af-3c21-40a6-ad39-76c6c85adf38",
   "metadata": {},
   "source": [
    "Chatgpt to parse out this file and give me a tsv with 3 columns. Entry, Source and ID\n",
    "\n",
    "Source is which SNV database it is from. Omimvar or clinvar or dbsnp or cosm or dbvar or cosf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e397b34-94e0-4564-bb21-2a92d161b5af",
   "metadata": {},
   "source": [
    "### switch back to bash"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fcdd7d0-27de-45f0-b6d3-96f9ee39f183",
   "metadata": {},
   "source": [
    "# Downloading all Variant Information"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "115df1ef-4e1c-4f31-babf-cd85960e6fea",
   "metadata": {},
   "source": [
    "**Not using dbVar as it has been discontinued and most of the links to dbvar are bad** ClinVar is the alternate and holds all of the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "61487285-a1af-4c1a-8b20-a52c8f26951b",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6ea7baea-ae92-4a93-a524-f44608dbe6d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "rm all_variants.txt\n",
    "rm all_variants_filtered.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b66473bd-cebc-4fa2-bdd8-310b67e82aaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      60\n",
      "     235\n",
      "     201\n",
      "     202\n",
      "      28\n",
      "      87\n"
     ]
    }
   ],
   "source": [
    "grep OmimVar parsed_variants.tsv | wc -l\n",
    "grep ClinVar  parsed_variants.tsv | wc -l\n",
    "grep dbSNP  parsed_variants.tsv | wc -l\n",
    "grep COSM  parsed_variants.tsv | wc -l\n",
    "grep dbVar  parsed_variants.tsv | wc -l\n",
    "grep COSF parsed_variants.tsv | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0eaac99f-2166-43c7-9b16-90616b71272d",
   "metadata": {},
   "source": [
    "### OmimVar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a135480-1dab-491c-b92b-03e9cc579c71",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "esearch -db clinvar -query \"601556[mim]\" | efetch -format docsum"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b364d25-e8dc-4815-8a62-0dd00554d875",
   "metadata": {},
   "source": [
    "From the output that you get, look for the variant ID in the output and then get that specific document summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cf01ffe2-79d1-4dd5-b130-2d1a07554b90",
   "metadata": {},
   "outputs": [],
   "source": [
    "grep OmimVar parsed_variants.tsv | cut -f3 > Omim/OmimVar_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f48e906b-70f7-420b-ba97-5105eda3c74d",
   "metadata": {},
   "source": [
    "It is being really difficult to run this with a loop in bash, so just running it all manually like this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3697f89a-6e17-435d-906d-927947259177",
   "metadata": {},
   "outputs": [],
   "source": [
    "esearch -db clinvar -query \"601978[mim]\" | efetch -format docsum > Omim/601978.xml\n",
    "esearch -db clinvar -query \"602533[mim]\" | efetch -format docsum > Omim/602533.xml\n",
    "esearch -db clinvar -query \"609007[mim]\" | efetch -format docsum > Omim/609007.xml\n",
    "esearch -db clinvar -query \"111730[mim]\" | efetch -format docsum > Omim/111730.xml\n",
    "esearch -db clinvar -query \"603448[mim]\" | efetch -format docsum > Omim/603448.xml\n",
    "esearch -db clinvar -query \"608300[mim]\" | efetch -format docsum > Omim/608300.xml\n",
    "esearch -db clinvar -query \"601143[mim]\" | efetch -format docsum > Omim/601143.xml\n",
    "esearch -db clinvar -query \"614260[mim]\" | efetch -format docsum > Omim/614260.xml\n",
    "esearch -db clinvar -query \"600543[mim]\" | efetch -format docsum > Omim/600543.xml\n",
    "esearch -db clinvar -query \"605078[mim]\" | efetch -format docsum > Omim/605078.xml\n",
    "esearch -db clinvar -query \"137070[mim]\" | efetch -format docsum > Omim/137070.xml\n",
    "esearch -db clinvar -query \"211100[mim]\" | efetch -format docsum > Omim/211100.xml\n",
    "esearch -db clinvar -query \"182100[mim]\" | efetch -format docsum > Omim/182100.xml\n",
    "esearch -db clinvar -query \"111100[mim]\" | efetch -format docsum > Omim/111100.xml\n",
    "esearch -db clinvar -query \"189980[mim]\" | efetch -format docsum > Omim/189980.xml\n",
    "esearch -db clinvar -query \"606463[mim]\" | efetch -format docsum > Omim/606463.xml\n",
    "esearch -db clinvar -query \"600429[mim]\" | efetch -format docsum > Omim/600429.xml\n",
    "esearch -db clinvar -query \"603371[mim]\" | efetch -format docsum > Omim/603371.xml\n",
    "esearch -db clinvar -query \"613109[mim]\" | efetch -format docsum > Omim/613109.xml\n",
    "esearch -db clinvar -query \"604834[mim]\" | efetch -format docsum > Omim/604834.xml\n",
    "esearch -db clinvar -query \"604473[mim]\" | efetch -format docsum > Omim/604473.xml\n",
    "esearch -db clinvar -query \"300264[mim]\" | efetch -format docsum > Omim/300264.xml\n",
    "esearch -db clinvar -query \"613004[mim]\" | efetch -format docsum > Omim/613004.xml\n",
    "esearch -db clinvar -query \"308000[mim]\" | efetch -format docsum > Omim/308000.xml\n",
    "esearch -db clinvar -query \"104760[mim]\" | efetch -format docsum > Omim/104760.xml\n",
    "esearch -db clinvar -query \"102600[mim]\" | efetch -format docsum > Omim/102600.xml\n",
    "esearch -db clinvar -query \"176264[mim]\" | efetch -format docsum > Omim/176264.xml\n",
    "esearch -db clinvar -query \"605411[mim]\" | efetch -format docsum > Omim/605411.xml\n",
    "esearch -db clinvar -query \"600734[mim]\" | efetch -format docsum > Omim/600734.xml\n",
    "esearch -db clinvar -query \"607047[mim]\" | efetch -format docsum > Omim/607047.xml\n",
    "esearch -db clinvar -query \"176763[mim]\" | efetch -format docsum > Omim/176763.xml\n",
    "esearch -db clinvar -query \"602544[mim]\" | efetch -format docsum > Omim/602544.xml\n",
    "esearch -db clinvar -query \"131340[mim]\" | efetch -format docsum > Omim/131340.xml\n",
    "esearch -db clinvar -query \"176610[mim]\" | efetch -format docsum > Omim/176610.xml\n",
    "esearch -db clinvar -query \"607922[mim]\" | efetch -format docsum > Omim/607922.xml\n",
    "esearch -db clinvar -query \"176640[mim]\" | efetch -format docsum > Omim/176640.xml\n",
    "esearch -db clinvar -query \"176801[mim]\" | efetch -format docsum > Omim/176801.xml\n",
    "esearch -db clinvar -query \"104311[mim]\" | efetch -format docsum > Omim/104311.xml\n",
    "esearch -db clinvar -query \"600759[mim]\" | efetch -format docsum > Omim/600759.xml\n",
    "esearch -db clinvar -query \"601556[mim]\" | efetch -format docsum > Omim/601556.xml\n",
    "esearch -db clinvar -query \"601517[mim]\" | efetch -format docsum > Omim/601517.xml\n",
    "esearch -db clinvar -query \"612895[mim]\" | efetch -format docsum > Omim/612895.xml\n",
    "esearch -db clinvar -query \"608309[mim]\" | efetch -format docsum > Omim/608309.xml\n",
    "esearch -db clinvar -query \"163890[mim]\" | efetch -format docsum > Omim/163890.xml\n",
    "esearch -db clinvar -query \"147450[mim]\" | efetch -format docsum > Omim/147450.xml\n",
    "esearch -db clinvar -query \"604985[mim]\" | efetch -format docsum > Omim/604985.xml\n",
    "esearch -db clinvar -query \"606765[mim]\" | efetch -format docsum > Omim/606765.xml\n",
    "esearch -db clinvar -query \"602345[mim]\" | efetch -format docsum > Omim/602345.xml\n",
    "esearch -db clinvar -query \"191110[mim]\" | efetch -format docsum > Omim/191110.xml\n",
    "esearch -db clinvar -query \"191342[mim]\" | efetch -format docsum > Omim/191342.xml\n",
    "esearch -db clinvar -query \"601023[mim]\" | efetch -format docsum > Omim/601023.xml\n",
    "esearch -db clinvar -query \"608537[mim]\" | efetch -format docsum > Omim/608537.xml\n",
    "esearch -db clinvar -query \"601011[mim]\" | efetch -format docsum > Omim/601011.xml\n",
    "esearch -db clinvar -query \"114206[mim]\" | efetch -format docsum > Omim/114206.xml\n",
    "esearch -db clinvar -query \"603094[mim]\" | efetch -format docsum > Omim/603094.xml\n",
    "esearch -db clinvar -query \"601530[mim]\" | efetch -format docsum > Omim/601530.xml\n",
    "esearch -db clinvar -query \"607904[mim]\" | efetch -format docsum > Omim/607904.xml\n",
    "esearch -db clinvar -query \"605704[mim]\" | efetch -format docsum > Omim/605704.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "43c6599c-6ec6-48a5-8264-8e86c1869e63",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      58 Omim/OmimVar_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l Omim/OmimVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a56a9106-06c6-4d23-983f-227ca14f85a4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "601978 exists.\n",
      "602533 exists.\n",
      "609007 exists.\n",
      "111730 exists.\n",
      "603448 exists.\n",
      "608300 exists.\n",
      "601143 exists.\n",
      "614260 exists.\n",
      "600543 exists.\n",
      "605078 exists.\n",
      "137070 exists.\n",
      "211100 exists.\n",
      "182100 exists.\n",
      "111100 exists.\n",
      "189980 exists.\n",
      "606463 exists.\n",
      "600429 exists.\n",
      "603371 exists.\n",
      "613109 exists.\n",
      "604834 exists.\n",
      "604473 exists.\n",
      "300264 exists.\n",
      "613004 exists.\n",
      "308000 exists.\n",
      "104760 exists.\n",
      "102600 exists.\n",
      "176264 exists.\n",
      "605411 exists.\n",
      "600734 exists.\n",
      "607047 exists.\n",
      "176763 exists.\n",
      "602544 exists.\n",
      "131340 exists.\n",
      "176610 exists.\n",
      "607922 exists.\n",
      "176640 exists.\n",
      "176801 exists.\n",
      "104311 exists.\n",
      "600759 exists.\n",
      "601556 exists.\n",
      "601517 exists.\n",
      "612895 exists.\n",
      "608309 exists.\n",
      "163890 exists.\n",
      "147450 exists.\n",
      "604985 exists.\n",
      "606765 exists.\n",
      "602345 exists.\n",
      "191110 exists.\n",
      "191342 exists.\n",
      "601023 exists.\n",
      "608537 exists.\n",
      "601011 exists.\n",
      "114206 exists.\n",
      "603094 exists.\n",
      "601530 exists.\n",
      "607904 exists.\n",
      "605704 exists.\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "if test -f Omim/$p.xml; then\n",
    "  echo \"$p exists.\"\n",
    "else\n",
    "    echo \"$p does not exist.\"\n",
    "fi\n",
    "done < Omim/OmimVar_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60f69460-27c4-4a92-995f-80a7540cb610",
   "metadata": {},
   "source": [
    "Switch to python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eba16469-12ac-4d00-8888-d6d997ce29f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "420ee5bf-976b-4805-9895-93e202f20ba2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "e9e90379-be34-418e-ad34-1d78fca075f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "def extract_linked_ids(xml_path, target_omim_prefix, outfile):\n",
    "    tree = ET.parse(xml_path)\n",
    "    root = tree.getroot()\n",
    "\n",
    "    for variation_xrefs in root.iter('variation_xrefs'):\n",
    "        block = []\n",
    "        matched_omim_id = None\n",
    "\n",
    "        for xref in variation_xrefs.findall('variation_xref'):\n",
    "            db_source = xref.findtext('db_source')\n",
    "            db_id = xref.findtext('db_id')\n",
    "\n",
    "            if db_source and db_id:\n",
    "                if db_source == \"OMIM\" and db_id.startswith(target_omim_prefix):\n",
    "                    matched_omim_id = db_id\n",
    "                block.append((db_source, db_id))\n",
    "\n",
    "        if matched_omim_id:\n",
    "            outfile.write(f\"OMIM ID found: {matched_omim_id}\\n\")\n",
    "            for source, id_ in block:\n",
    "                if source != \"OMIM\":\n",
    "                    outfile.write(f\"{source}:{id_}\\n\")\n",
    "            outfile.write(\"\\n\")  # Blank line between blocks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "f2fb49e1-feb3-41c3-81a2-a1c5e5f0c9bf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded OMIM IDs: ['601978', '602533', '609007', '111730', '603448']\n"
     ]
    }
   ],
   "source": [
    "# Load OMIM IDs from file into a list\n",
    "with open(\"Omim/OmimVar_id.txt\", \"r\") as f:\n",
    "    omim_ids = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "# Optional: print first few IDs\n",
    "print(\"Loaded OMIM IDs:\", omim_ids[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "45c074e5-9026-49d0-8a03-18537db41451",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Fixed: 609007 → saved to Omim_fixed/609007.xml\n",
      "✅ Fixed: 601143 → saved to Omim_fixed/601143.xml\n",
      "✅ Fixed: 604985 → saved to Omim_fixed/604985.xml\n",
      "✅ Fixed: 608537 → saved to Omim_fixed/608537.xml\n",
      "✅ Fixed: 601011 → saved to Omim_fixed/601011.xml\n",
      "✅ Fixed: 114206 → saved to Omim_fixed/114206.xml\n",
      "✅ Fixed: 607904 → saved to Omim_fixed/607904.xml\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import re\n",
    "\n",
    "# There were issues with some XMLs being malformed. So editing the problematic ones to make one common root.\n",
    "problematic_ids = [\n",
    "    \"609007\", \"601143\", \"604985\", \"608537\", \"601011\", \"114206\", \"607904\"\n",
    "]\n",
    "\n",
    "input_folder = \"Omim\"\n",
    "output_folder = \"Omim_fixed\"\n",
    "os.makedirs(output_folder, exist_ok=True)\n",
    "\n",
    "for omim_id in problematic_ids:\n",
    "    input_file = os.path.join(input_folder, f\"{omim_id}.xml\")\n",
    "    output_file = os.path.join(output_folder, f\"{omim_id}.xml\")\n",
    "\n",
    "    with open(input_file, \"r\") as f:\n",
    "        xml_content = f.read()\n",
    "\n",
    "    # Remove leading/trailing whitespace\n",
    "    xml_content = xml_content.strip()\n",
    "\n",
    "    # Remove any existing XML declaration or DOCTYPE lines\n",
    "    xml_content = re.sub(r'<\\?xml[^>]+\\?>', '', xml_content)\n",
    "    xml_content = re.sub(r'<!DOCTYPE[^>]*>', '', xml_content)\n",
    "\n",
    "    # Wrap content in <root> and insert declarations at the top\n",
    "    fixed_xml = (\n",
    "        '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\\n'\n",
    "        '<!DOCTYPE root>\\n'\n",
    "        '<root>\\n'\n",
    "        f'{xml_content.strip()}\\n'\n",
    "        '</root>'\n",
    "    )\n",
    "\n",
    "    # Write the fixed file\n",
    "    with open(output_file, \"w\") as f:\n",
    "        f.write(fixed_xml)\n",
    "\n",
    "    print(f\"✅ Fixed: {omim_id} → saved to {output_file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e34e0e1-96c3-492b-ae4b-1b725de062c2",
   "metadata": {},
   "source": [
    "Iterating over all XMLs and parsing them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dae32152-5f63-4ba6-9968-b6d443189fca",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "21eaff4b-e351-466a-a13f-85b10da15803",
   "metadata": {},
   "outputs": [],
   "source": [
    "good_ids = [id for id in omim_ids if id not in problematic_ids]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "d7075e77-a7ab-42c6-b4c5-91eedd698a05",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "51 7\n"
     ]
    }
   ],
   "source": [
    "print(len(good_ids), len(problematic_ids))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "4fdfe8dc-3826-407a-8472-7a63da9ca53c",
   "metadata": {},
   "outputs": [],
   "source": [
    "for id in good_ids:\n",
    "    with open(f'Omim/{id}_parsed.txt', \"w\") as f:\n",
    "        try:\n",
    "            extract_linked_ids(f'Omim/{id}.xml', id, f)\n",
    "        except:\n",
    "            print(id)\n",
    "            break\n",
    "            \n",
    "for id in problematic_ids:\n",
    "    with open(f'Omim/{id}_parsed.txt', \"w\") as f:\n",
    "        try:\n",
    "            extract_linked_ids(f'Omim_fixed/{id}.xml', id, f)\n",
    "        except:\n",
    "            print(id)\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48c5706f-14b9-4903-bf51-1df39bf700ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "if test -f Omim/\"$p\"_parsed.txt; then\n",
    "  echo \"$p exists.\"\n",
    "else\n",
    "    echo \"$p does not exist.\"\n",
    "fi\n",
    "done < Omim/OmimVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "9c028c9b-6d83-4718-9f69-cfb3ec9b85de",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat Omim/*_parsed.txt > Omim_parsed.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d0634b17-0831-4db6-ae86-305c97735e8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed -i '' '/^ClinGen/d' Omim_parsed.txt\n",
    "sed -i '' '/^UniProtKB/d' Omim_parsed.txt\n",
    "sed -i '' '/^ClinVar/d' Omim_parsed.txt\n",
    "sed -i '' '/^dbVar/d' Omim_parsed.txt\n",
    "sed -i '' '/^Genetic/d' Omim_parsed.txt\n",
    "sed -i '' '/^LOVD/d' Omim_parsed.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2101ed00-feb4-4004-981a-e3c61976d339",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Parsed file into Omim_parsed.tsv\n"
     ]
    }
   ],
   "source": [
    "#!/bin/bash\n",
    "\n",
    "input_file=\"Omim_parsed.txt\"       # Your input file\n",
    "output_file=\"Omim_parsed.tsv\"     # Output TSV file\n",
    "\n",
    "# Write header\n",
    "echo -e \"omim_id\\tdbsnp_id\" > \"$output_file\"\n",
    "\n",
    "# Initialize variables\n",
    "omim_id=\"\"\n",
    "dbsnp_id=\"\"\n",
    "\n",
    "# Read the file line-by-line\n",
    "while IFS= read -r line || [ -n \"$line\" ]; do\n",
    "    # If it's an OMIM line\n",
    "    if [[ $line == OMIM\\ ID\\ found:* ]]; then\n",
    "        # If we had a previous OMIM without dbSNP, write it now\n",
    "        if [[ -n $omim_id ]]; then\n",
    "            echo -e \"${omim_id}\\t${dbsnp_id}\" >> \"$output_file\"\n",
    "        fi\n",
    "        omim_id=\"${line#OMIM ID found: }\"\n",
    "        dbsnp_id=\"\"  # Reset dbSNP\n",
    "    elif [[ $line == dbSNP:* ]]; then\n",
    "        dbsnp_id=\"${line#dbSNP:}\"\n",
    "    fi\n",
    "done < \"$input_file\"\n",
    "\n",
    "# Write the last record\n",
    "if [[ -n $omim_id ]]; then\n",
    "    echo -e \"${omim_id}\\t${dbsnp_id}\" >> \"$output_file\"\n",
    "fi\n",
    "\n",
    "echo \"✅ Parsed file into $output_file\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a35a9986-62df-4893-93de-21733eb68404",
   "metadata": {},
   "source": [
    "Adding 624 dbSNP IDs to the dbSNP file for retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6acb6389-7b0e-4dc6-871c-952528646920",
   "metadata": {},
   "source": [
    "### ClinVar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "29531b7b-33f1-4e15-b0b9-8090c4bde11f",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "2f98ba8b-3120-416d-a13e-a8f7235992bd",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n",
      "<!DOCTYPE DocumentSummarySet>\n",
      "<DocumentSummarySet status=\"OK\">\n",
      "  <DbBuild>Build250414-1300.1</DbBuild>\n",
      "  <DocumentSummary>\n",
      "    <Id>17584</Id>\n",
      "    <obj_type>single nucleotide variant</obj_type>\n",
      "    <accession>VCV000017584</accession>\n",
      "    <accession_version>VCV000017584.5</accession_version>\n",
      "    <title>NM_001904.4(CTNNB1):c.101G&gt;A (p.Gly34Glu)</title>\n",
      "    <variation_set>\n",
      "      <variation>\n",
      "        <measure_id>32623</measure_id>\n",
      "        <variation_xrefs>\n",
      "          <variation_xref>\n",
      "            <db_source>ClinGen</db_source>\n",
      "            <db_id>CA127277</db_id>\n",
      "          </variation_xref>\n",
      "          <variation_xref>\n",
      "            <db_source>UniProtKB</db_source>\n",
      "            <db_id>P35222#VAR_017620</db_id>\n",
      "          </variation_xref>\n",
      "          <variation_xref>\n",
      "            <db_source>OMIM</db_source>\n",
      "            <db_id>116806.0008</db_id>\n",
      "          </variation_xref>\n",
      "          <variation_xref>\n",
      "            <db_source>dbSNP</db_source>\n",
      "            <db_id>28931589</db_id>\n",
      "          </variation_xref>\n",
      "        </variation_xrefs>\n",
      "        <variation_name>NM_001904.4(CTNNB1):c.101G&gt;A (p.Gly34Glu)</variation_name>\n",
      "        <cdna_change>c.101G&gt;A</cdna_change>\n",
      "        <variation_loc>\n",
      "          <assembly_set>\n",
      "            <status>current</status>\n",
      "            <assembly_name>GRCh38</assembly_name>\n",
      "            <chr>3</chr>\n",
      "            <band>3p22.1</band>\n",
      "            <start>41224613</start>\n",
      "            <stop>41224613</stop>\n",
      "            <display_start>41224613</display_start>\n",
      "            <display_stop>41224613</display_stop>\n",
      "            <assembly_acc_ver>GCF_000001405.38</assembly_acc_ver>\n",
      "          </assembly_set>\n",
      "          <assembly_set>\n",
      "            <status>previous</status>\n",
      "            <assembly_name>GRCh37</assembly_name>\n",
      "            <chr>3</chr>\n",
      "            <band>3p22.1</band>\n",
      "            <start>41266104</start>\n",
      "            <stop>41266104</stop>\n",
      "            <display_start>41266104</display_start>\n",
      "            <display_stop>41266104</display_stop>\n",
      "            <assembly_acc_ver>GCF_000001405.25</assembly_acc_ver>\n",
      "          </assembly_set>\n",
      "        </variation_loc>\n",
      "        <allele_freq_set>\n",
      "          <allele_freq>\n",
      "            <source>Exome Aggregation Consortium (ExAC)</source>\n",
      "            <value>0.00001</value>\n",
      "          </allele_freq>\n",
      "        </allele_freq_set>\n",
      "        <variant_type>single nucleotide variant</variant_type>\n",
      "        <canonical_spdi>NC_000003.12:41224612:G:A</canonical_spdi>\n",
      "      </variation>\n",
      "    </variation_set>\n",
      "    <supporting_submissions>\n",
      "      <scv>\n",
      "        <string>SCV000039437</string>\n",
      "        <string>SCV000599908</string>\n",
      "      </scv>\n",
      "      <rcv>\n",
      "        <string>RCV000019149</string>\n",
      "        <string>RCV000443977</string>\n",
      "      </rcv>\n",
      "    </supporting_submissions>\n",
      "    <germline_classification>\n",
      "      <description>Pathogenic; other</description>\n",
      "      <last_evaluated>2016/05/01 00:00</last_evaluated>\n",
      "      <review_status>no assertion criteria provided</review_status>\n",
      "      <trait_set>\n",
      "        <trait>\n",
      "          <trait_xrefs>\n",
      "            <trait_xref>\n",
      "              <db_source>Orphanet</db_source>\n",
      "              <db_id>616</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MedGen</db_source>\n",
      "              <db_id>C0025149</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MeSH</db_source>\n",
      "              <db_id>D008527</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MONDO</db_source>\n",
      "              <db_id>MONDO:0007959</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>OMIM</db_source>\n",
      "              <db_id>155255</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>Human Phenotype Ontology</db_source>\n",
      "              <db_id>HP:0002885</db_id>\n",
      "            </trait_xref>\n",
      "          </trait_xrefs>\n",
      "          <trait_name>Medulloblastoma</trait_name>\n",
      "        </trait>\n",
      "        <trait>\n",
      "          <trait_xrefs>\n",
      "            <trait_xref>\n",
      "              <db_source>Orphanet</db_source>\n",
      "              <db_id>91414</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MedGen</db_source>\n",
      "              <db_id>C0206711</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MeSH</db_source>\n",
      "              <db_id>D018296</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>MONDO</db_source>\n",
      "              <db_id>MONDO:0007564</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>OMIM</db_source>\n",
      "              <db_id>132600</db_id>\n",
      "            </trait_xref>\n",
      "            <trait_xref>\n",
      "              <db_source>Human Phenotype Ontology</db_source>\n",
      "              <db_id>HP:0030434</db_id>\n",
      "            </trait_xref>\n",
      "          </trait_xrefs>\n",
      "          <trait_name>Pilomatrixoma</trait_name>\n",
      "        </trait>\n",
      "      </trait_set>\n",
      "    </germline_classification>\n",
      "    <clinical_impact_classification>\n",
      "      <last_evaluated>1/01/01 00:00</last_evaluated>\n",
      "    </clinical_impact_classification>\n",
      "    <oncogenicity_classification>\n",
      "      <last_evaluated>1/01/01 00:00</last_evaluated>\n",
      "    </oncogenicity_classification>\n",
      "    <gene_sort>CTNNB1</gene_sort>\n",
      "    <chr_sort>03</chr_sort>\n",
      "    <location_sort>00000000000041224613</location_sort>\n",
      "    <genes>\n",
      "      <gene>\n",
      "        <symbol>CTNNB1</symbol>\n",
      "        <GeneID>1499</GeneID>\n",
      "        <strand>+</strand>\n",
      "        <source>submitted</source>\n",
      "      </gene>\n",
      "      <gene>\n",
      "        <symbol>LOC126806658</symbol>\n",
      "        <GeneID>126806658</GeneID>\n",
      "        <strand>+</strand>\n",
      "        <source>submitted</source>\n",
      "      </gene>\n",
      "    </genes>\n",
      "    <molecular_consequence_list>\n",
      "      <string>missense variant</string>\n",
      "    </molecular_consequence_list>\n",
      "    <protein_change>G34E, G27E</protein_change>\n",
      "  </DocumentSummary>\n",
      "</DocumentSummarySet>\n"
     ]
    }
   ],
   "source": [
    "esearch -db clinvar -query 17584 | efetch -format docsum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "04a62ef6-5c7d-4644-bc8f-26148b470dbf",
   "metadata": {},
   "outputs": [],
   "source": [
    "grep ClinVar parsed_variants.tsv | cut -f3 > ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "29542800-1697-481f-a0b7-5236dee9752e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     232 ClinVar/ClinVar_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e37a7a54-6049-4a86-808d-37a3f64721ac",
   "metadata": {},
   "source": [
    "Saved all of the esearch queries to clinvar_esearch.sh . 232 of them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6c56c9f6-a6f5-487f-b474-a0dcf4b0f763",
   "metadata": {},
   "outputs": [],
   "source": [
    "chmod +x ClinVar/clinvar_esearch.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "78286511-609e-4f02-9717-4d094e9feebb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     232 ClinVar/clinvar_esearch.sh\n"
     ]
    }
   ],
   "source": [
    "wc -l ClinVar/clinvar_esearch.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "321b9cb0-0cc4-4442-bcbc-01e0d7a2f50c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ClinVar/./clinvar_esearch.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "d6ae0dde-45c0-4853-b5dc-11ed6d195eec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "376308 is empty\n",
      "376242 is empty\n",
      "376235 is empty\n",
      "376233 is empty\n",
      "375895 is empty\n",
      "376282 is empty\n",
      "376280 is empty\n",
      "396706 is empty\n",
      "375971 is empty\n",
      "376068 is empty\n",
      "376728 is empty\n",
      "160870 is empty\n",
      "376464 is empty\n",
      "376461 is empty\n",
      "375873 is empty\n",
      "376220 is empty\n",
      "375871 is empty\n",
      "375872 is empty\n",
      "376221 is empty\n",
      "376069 is empty\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "[ -s ClinVar/$p.xml ] || echo \"$p is empty\"\n",
    "done < ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db67cf54-0487-4736-a323-5b0417b50295",
   "metadata": {},
   "source": [
    "There are 20 XMLs as seen above that have been deleted so I cannot access them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "61f507ed-7ccf-4a43-aee9-0f70aea28791",
   "metadata": {},
   "outputs": [],
   "source": [
    "esearch -db clinvar -query 177620 | efetch -format docsum > ClinVar/177620.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "f3b18da5-4f8f-46be-8a51-f20fb15a40fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "[ -s ClinVar/$p.xml ] || rm ClinVar/$p.xml\n",
    "done < ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "fd9671a2-9c21-46be-ab39-37a871ffbebd",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed -i '' '/^376308$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376242$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376235$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376233$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^375895$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376282$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376280$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^396706$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^375971$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376068$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376728$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^160870$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376464$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376461$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^375873$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376220$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^375871$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^375872$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376221$/d' ClinVar/ClinVar_id.txt\n",
    "sed -i '' '/^376069$/d' ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "88d4c572-be3d-4b9c-a00e-d194f2b46351",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     212 ClinVar/ClinVar_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l ClinVar/ClinVar_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "36907abc-374c-4803-bc4f-bac8890246b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     214\n"
     ]
    }
   ],
   "source": [
    "ls ClinVar | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90173878-a854-4453-83a4-f32465f9425a",
   "metadata": {},
   "source": [
    "214 is good and checks out. 214 - 2 = 212 which is how many ids we have"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1371ee83-ef5a-48fa-b43d-f880a744a5ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d4e6c38d-a397-423a-ac26-02b1240cab25",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "import os\n",
    "\n",
    "# Paths\n",
    "id_file = \"ClinVar/ClinVar_id.txt\"\n",
    "input_folder = \"ClinVar\"\n",
    "output_file = \"ClinVar_parsed_output.tsv\"\n",
    "\n",
    "# Read all IDs from the input file\n",
    "with open(id_file, \"r\") as f:\n",
    "    clinvar_ids = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "# Prepare output file\n",
    "with open(output_file, \"w\") as out:\n",
    "    # Write header\n",
    "    out.write(\"ClinVar_ID\\tseq_id\\tposition\\tref\\talt\\n\")\n",
    "\n",
    "    for cid in clinvar_ids:\n",
    "        xml_path = os.path.join(input_folder, f\"{cid}.xml\")\n",
    "        if not os.path.exists(xml_path):\n",
    "            print(f\"⚠️ File not found: {xml_path}\")\n",
    "            continue\n",
    "\n",
    "        try:\n",
    "            # Parse XML\n",
    "            tree = ET.parse(xml_path)\n",
    "            root = tree.getroot()\n",
    "\n",
    "            # Find all canonical_spdi tags\n",
    "            for spdi in root.iter(\"canonical_spdi\"):\n",
    "                text = spdi.text\n",
    "                if text and \":\" in text:\n",
    "                    parts = text.split(\":\")\n",
    "                    if len(parts) == 4:\n",
    "                        seq_id, pos, ref, alt = parts\n",
    "                        out.write(f\"{cid}\\t{seq_id}\\t{pos}\\t{ref}\\t{alt}\\n\")\n",
    "\n",
    "        except ET.ParseError as e:\n",
    "            print(f\"❌ Parse error in {cid}.xml: {e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ad902e02-e988-4b69-ac36-435bfd317c9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "parsed = ['16928','16929','183391','183393','183395','8823','420108','9409','376307','220711','376310','376305','376303','77637','376384','233484','127526','182409','376306','182423','17577','17576','17580','17587','17588','17579','17583','17582','376231','17584','376232','17589','17578','376228','177620','16609','45263','16613','16339','16359','16332','16333','16342','16348','16276','16273','16272','375972','16274','15933','15934','15935','15936','801','184937','802','12602','12613','35554','180848','160364','376033','219296','9834','9381','39571','39572','14801','13860','13863','13852','12582','12583','12580','12578','16677','16685','16686','16688','186141','13881','13882','13886','13883','376126','13888','13889','13890','162466','162468','162465','375876','13901','13900','373003','39648','73058','375874','177778','162469','162470','5286','225431','225433','225434','225432','31944','13655','13652','13653','13659','91945','12674','164995','13244','13245','13246','13247','13251','13250','13249','409162','418436','7829','427590','187657','7814','7837','7836','7838','7833','189486','428256','186396','404151','375958','7813','189403','185200','189484','7815','92828','189448','9511','9512','13087','428681','13919','13911','38629','37102','13951','8117','8118','13961','375941','12511','213936','217016','12374','12356','12366','12347','12365','43594','12364','12355','127819','376570','12372','2216','43604','93326','2223','417961','14464','6390','41166','41209','4893','4886','4892','161992','161993','161995']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ae292770-d8b9-4d11-a20a-b8d336d2aed3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "185"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(parsed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "57ccde64-cc40-4c96-b334-455be7752d3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "remaining = [id for id in clinvar_ids if id not in parsed]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "49d32e6a-58f7-4325-b779-592fdb9addc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir ClinVar_remaining"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "28893df1-1863-4442-8d3c-3c944cba9244",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Copied: 268075.xml\n",
      "✅ Copied: 150740.xml\n",
      "✅ Copied: 59680.xml\n",
      "✅ Copied: 59682.xml\n",
      "✅ Copied: 148363.xml\n",
      "✅ Copied: 58696.xml\n",
      "✅ Copied: 57282.xml\n",
      "✅ Copied: 59782.xml\n",
      "✅ Copied: 153718.xml\n",
      "✅ Copied: 148679.xml\n",
      "✅ Copied: 16270.xml\n",
      "✅ Copied: 59715.xml\n",
      "✅ Copied: 394884.xml\n",
      "✅ Copied: 153231.xml\n",
      "✅ Copied: 151754.xml\n",
      "✅ Copied: 149554.xml\n",
      "✅ Copied: 153441.xml\n",
      "✅ Copied: 148269.xml\n",
      "✅ Copied: 57074.xml\n",
      "✅ Copied: 394609.xml\n",
      "✅ Copied: 58030.xml\n",
      "✅ Copied: 58029.xml\n",
      "✅ Copied: 58028.xml\n",
      "✅ Copied: 441904.xml\n",
      "✅ Copied: 146814.xml\n",
      "✅ Copied: 144406.xml\n",
      "✅ Copied: 57042.xml\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import shutil \n",
    "# Paths\n",
    "source_dir = \"ClinVar\"\n",
    "dest_dir = \"ClinVar_remaining\"\n",
    "\n",
    "# Ensure destination folder exists\n",
    "os.makedirs(dest_dir, exist_ok=True)\n",
    "\n",
    "# Iterate and copy files\n",
    "for clinvar_id in remaining:\n",
    "    src = os.path.join(source_dir, f\"{clinvar_id}.xml\")\n",
    "    dst = os.path.join(dest_dir, f\"{clinvar_id}.xml\")\n",
    "\n",
    "    if os.path.exists(src):\n",
    "        shutil.copy(src, dst)\n",
    "        print(f\"✅ Copied: {clinvar_id}.xml\")\n",
    "    else:\n",
    "        print(f\"⚠️ Missing: {clinvar_id}.xml\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8cd41b62-cb7b-4380-9d79-de1befc1637c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      27\n"
     ]
    }
   ],
   "source": [
    "!ls Clinvar_remaining | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "eb58056e-71ad-4602-847f-457366f2b963",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat Clinvar_remaining/* > Clinvar_remaining/all_remaining_variants.xml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b779d44-a620-42c8-8f0c-5ab96dcec165",
   "metadata": {},
   "source": [
    "They are all copy number gain variations. Nothing that I can do for this project. So we will stick with our 185 parsed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "29e34b3a-8a05-448a-a22f-6f9a0712f221",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     186 ClinVar_parsed_output.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc -l ClinVar_parsed_output.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb3f65c6-6c01-4aab-ba6b-b46c7834ff1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "rm -r Clinvar_remaining"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db6bf93c-404b-43fd-8699-5cd8de1ae03b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "a0610722-b83d-4d72-b0cf-75135eaa7141",
   "metadata": {},
   "source": [
    "### dbSNP"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b650b97-61c1-4db3-8e0c-9c4df6d4aac6",
   "metadata": {},
   "source": [
    "Have to get the variants from OmimVar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "470b0539-8f7f-49f0-8f95-e9293e3872d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "751381ad-54e6-4063-a1f2-71c4ffb2e617",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n",
      "<!DOCTYPE DocumentSummarySet>\n",
      "<DocumentSummarySet status=\"OK\">\n",
      "  <DbBuild>Build250306-1408.1</DbBuild>\n",
      "  <DocumentSummary>\n",
      "    <Id>1131690863</Id>\n",
      "    <SNP_ID>1131690863</SNP_ID>\n",
      "    <GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE>\n",
      "    <CLINICAL_SIGNIFICANCE>uncertain-significance,pathogenic</CLINICAL_SIGNIFICANCE>\n",
      "    <GENES>\n",
      "      <GENE_E>\n",
      "        <NAME>RB1</NAME>\n",
      "        <GENE_ID>5925</GENE_ID>\n",
      "      </GENE_E>\n",
      "      <GENE_E>\n",
      "        <NAME>LOC112268118</NAME>\n",
      "        <GENE_ID>112268118</GENE_ID>\n",
      "      </GENE_E>\n",
      "    </GENES>\n",
      "    <ACC>NC_000013.11</ACC>\n",
      "    <CHR>13</CHR>\n",
      "    <HANDLE>EVA,CSS-BFX,CLINVAR</HANDLE>\n",
      "    <SPDI>NC_000013.11:48362846:C:A,NC_000013.11:48362846:C:G,NC_000013.11:48362846:C:T</SPDI>\n",
      "    <FXN_CLASS>coding_sequence_variant,stop_gained,500B_downstream_variant,synonymous_variant,missense_variant,downstream_transcript_variant</FXN_CLASS>\n",
      "    <VALIDATED>by-cluster</VALIDATED>\n",
      "    <DOCSUM>HGVS=NC_000013.11:g.48362847C&gt;A,NC_000013.11:g.48362847C&gt;G,NC_000013.11:g.48362847C&gt;T,NC_000013.10:g.48936983C&gt;A,NC_000013.10:g.48936983C&gt;G,NC_000013.10:g.48936983C&gt;T,NG_009009.1:g.64101C&gt;A,NG_009009.1:g.64101C&gt;G,NG_009009.1:g.64101C&gt;T,NM_000321.3:c.751C&gt;A,NM_000321.3:c.751C&gt;G,NM_000321.3:c.751C&gt;T,NM_000321.2:c.751C&gt;A,NM_000321.2:c.751C&gt;G,NM_000321.2:c.751C&gt;T,NM_001407166.1:c.751C&gt;A,NM_001407166.1:c.751C&gt;G,NM_001407166.1:c.751C&gt;T,NM_001407165.1:c.751C&gt;A,NM_001407165.1:c.751C&gt;G,NM_001407165.1:c.751C&gt;T,NP_000312.2:p.Arg251Gly,NP_000312.2:p.Arg251Ter|SEQ=[C/A/G/T]|LEN=1|GENE=RB1:5925,LOC112268118:112268118</DOCSUM>\n",
      "    <TAX_ID>9606</TAX_ID>\n",
      "    <ORIG_BUILD>150</ORIG_BUILD>\n",
      "    <UPD_BUILD>157</UPD_BUILD>\n",
      "    <CREATEDATE>2017/07/17 11:16</CREATEDATE>\n",
      "    <UPDATEDATE>2024/11/03 17:09</UPDATEDATE>\n",
      "    <SS>2137537937,6403986513,8442109874,8936184886</SS>\n",
      "    <ALLELE>N</ALLELE>\n",
      "    <SNP_CLASS>snv</SNP_CLASS>\n",
      "    <CHRPOS>13:48362847</CHRPOS>\n",
      "    <CHRPOS_PREV_ASSM>13:48936983</CHRPOS_PREV_ASSM>\n",
      "    <SNP_ID_SORT>1131690863</SNP_ID_SORT>\n",
      "    <CLINICAL_SORT>1</CLINICAL_SORT>\n",
      "    <CHRPOS_SORT>0048362847</CHRPOS_SORT>\n",
      "    <MERGED_SORT>0</MERGED_SORT>\n",
      "  </DocumentSummary>\n",
      "</DocumentSummarySet>\n"
     ]
    }
   ],
   "source": [
    "esearch -db snp -query rs1131690863 | efetch -format docsum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5cc99d6c-79f3-49fe-8bfa-ddda5821870a",
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir dbSNP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "0fe72bad-ed7c-4785-a475-2b39cc31974b",
   "metadata": {},
   "outputs": [],
   "source": [
    "grep dbSNP parsed_variants.tsv | cut -f3 > dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "6f19e412-c93a-498e-8995-e8ec7b2a2398",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     201 dbSNP/dbSNP_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9f1615b-81eb-48a7-adfd-b6f5a0161e79",
   "metadata": {},
   "source": [
    "Added from Omim and removed repeats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4bb0fcc0-8f66-4fc2-9b7a-02b321194636",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     761 dbSNP/dbSNP_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "e2e729f7-4bec-4f40-986c-fd0103186030",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Saved the scripts to download all 761\n",
    "chmod +x dbSNP/dbSNP_search.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "2b271743-e43d-46ef-b689-2f928007cb4f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b(B\u001b[m\u001b[31m\u001b[1m\u001b[7m ERROR: \u001b(B\u001b[m\u001b[31m\u001b[1m Missing -db argument\u001b(B\u001b[m\n",
      "\u001b(B\u001b[m\u001b[31m\u001b[1m\u001b[7m ERROR: \u001b(B\u001b[m\u001b[31m\u001b[1m Missing -db argument\u001b(B\u001b[m\n"
     ]
    }
   ],
   "source": [
    "dbSNP/./dbSNP_search.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "1bf2df42-f562-4a29-85b2-7caea7919d58",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "if ! test -f dbSNP/\"$p\".xml; then\n",
    "    echo \"$p does not exist.\"\n",
    "fi\n",
    "done < dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "1b8d5d2d-83b1-4e23-8595-00c9ad01148b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs121908237 is empty\n",
      "rs137852480 is empty\n",
      "rs13785281 is empty\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "[ -s dbSNP/\"$p\".xml ] || echo \"$p is empty\"\n",
    "done < dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "8a5de403-f6da-402b-a97a-32127db07014",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "esearch -db snp -query rs121908237 | efetch -format docsum > dbSNP/rs121908237.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "e3dd144a-4aaa-4d72-84ee-2c43ce48f627",
   "metadata": {},
   "outputs": [],
   "source": [
    "esearch -db snp -query rs137852480 | efetch -format docsum > dbSNP/rs137852480.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "93b02ebc-fe12-47a5-aa35-1c51445fcdcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "esearch -db snp -query rs13785281 | efetch -format docsum > dbSNP/rs13785281.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5bc843db-1b32-4ec7-82f6-d1dec6589415",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rs13785281 is empty\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "[ -s dbSNP/\"$p\".xml ] || echo \"$p is empty\"\n",
    "done < dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e55ec094-ee54-459d-92b0-2d767c1c428e",
   "metadata": {},
   "source": [
    "rs13785281 is not found and is removed from the id file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "0558f5d9-0627-4187-a451-90111fa2b1d9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     760 dbSNP/dbSNP_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "1f09c627-0e45-4632-9918-0ee00e34350b",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read -r p; do\n",
    "    if ! grep -q SPDI \"dbSNP/$p.xml\"; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < dbSNP/dbSNP_id.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1873b4d1-96bd-431a-98ad-b47c361bbefb",
   "metadata": {},
   "source": [
    "No output means every file has SPDI. yay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bfd640d-765a-4cb8-a459-27c6ff897572",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "9a487229-2a65-4437-9c58-639774197373",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "# File paths\n",
    "input_ids_file = \"dbSNP/dbSNP_id.txt\"\n",
    "input_folder = \"dbSNP\"\n",
    "output_file = \"dbSNP_output.tsv\"\n",
    "\n",
    "# Read SNP IDs\n",
    "with open(input_ids_file, \"r\") as f:\n",
    "    dbsnp_ids = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "# Open TSV output file\n",
    "with open(output_file, \"w\") as out:\n",
    "    out.write(\"dbsnp_id\\tsequence_id\\tposition\\tref\\talt\\n\")\n",
    "\n",
    "    for dbsnp_id in dbsnp_ids:\n",
    "        xml_path = os.path.join(input_folder, f\"{dbsnp_id}.xml\")\n",
    "\n",
    "        if not os.path.exists(xml_path):\n",
    "            print(f\"⚠️ Missing: {xml_path}\")\n",
    "            continue\n",
    "\n",
    "        try:\n",
    "            tree = ET.parse(xml_path)\n",
    "            root = tree.getroot()\n",
    "\n",
    "            for spdi in root.iter(\"SPDI\"):\n",
    "                if spdi.text:\n",
    "                    spdi_items = spdi.text.strip().split(\",\")\n",
    "                    for item in spdi_items:\n",
    "                        parts = item.strip().split(\":\")\n",
    "                        if len(parts) == 4:\n",
    "                            seq_id, pos, ref, alt = parts\n",
    "                            out.write(f\"{dbsnp_id}\\t{seq_id}\\t{pos}\\t{ref}\\t{alt}\\n\")\n",
    "                        else:\n",
    "                            print(f\"⚠️ Invalid SPDI format in {dbsnp_id}: {item}\")\n",
    "\n",
    "        except ET.ParseError as e:\n",
    "            print(f\"❌ Parse error in {dbsnp_id}.xml: {e}\")\n",
    "        except Exception as e:\n",
    "            print(f\"❌ Unexpected error in {dbsnp_id}.xml: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cbf7146-7d94-4905-adc0-f7d1e6443074",
   "metadata": {},
   "source": [
    "Removed the duplicate lines and am left with 1408 mutations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "7727dcc2-ed83-4f8d-808e-d8e24643a53a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1409 dbSNP_output.tsv\n"
     ]
    }
   ],
   "source": [
    "!wc -l dbSNP_output.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9687bc9-97e0-41e9-82a1-5cfb986ae13b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "c4bd83bb-deea-41c0-87dc-d0982b0cc00b",
   "metadata": {},
   "source": [
    "### COSM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "65fb0b92-2e7b-4080-935a-e74b58bf0329",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "645e17ee-fe7c-4b48-bece-3c17de3dbd9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir COSM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "d04ca4ce-29ab-446d-8df9-95984f3c403f",
   "metadata": {},
   "outputs": [],
   "source": [
    "grep COSM parsed_variants.tsv | cut -f 3 > COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "12380e56-6f2a-4f8d-9bcf-f97275e6e39b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     202 COSM/COSM_ids.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "3c63290a-2d6d-4cd7-949d-ae3fe77f0136",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1677139\n",
      "1989836\n",
      "12523\n",
      "13800\n",
      "12475\n",
      "12504\n",
      "12506\n",
      "13281\n",
      "12512\n",
      "12476\n"
     ]
    }
   ],
   "source": [
    "head COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "29147778-754b-4ab6-a4ef-49cd5f314503",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read id; do\n",
    "curl --silent \"https://rest.ensembl.org/variation/human/\"$id\"?content-type=application/json\" > COSM/\"$id\".txt\n",
    "done < COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "df9c5163-8841-4279-940a-86d74b69f6a3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     203\n"
     ]
    }
   ],
   "source": [
    "ls COSM/* | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcc33132-cb25-49a7-8615-d2dc32278e4d",
   "metadata": {},
   "source": [
    "download the COSM database from here https://cancer.sanger.ac.uk/cosmic/download/cosmic/v101/completetargetedscreensmutanttsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "c95fce22-ab4d-44d7-9321-d10bf1dfb368",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed 's/$/\\t/' COSM/COSM_ids.txt > COSM/COSM_ids_tab.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "65045648-4b11-4264-a11a-3b47d58e0bc6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "grep -F -f COSM/COSM_ids_tab.txt Cosmic_CompleteTargetedScreensMutant_v101_GRCh38.tsv > COSM_matched.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "d2c438e0-9110-4f04-b155-71bf04e399f4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  372391 COSM_matched.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM_matched.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "b7a458ce-3980-4ba8-a74f-e16f4534a6cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     160 COSM_matched_id_unique.txt\n"
     ]
    }
   ],
   "source": [
    "cut -f 8 COSM_matched.tsv > COSM_matched_id.txt\n",
    "sort -u COSM_matched_id.txt > COSM_matched_id_unique.txt\n",
    "wc -l COSM_matched_id_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "22e104b8-587c-4b11-bb40-ef05a4fd1899",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COSM12475\n",
      "COSM12506\n",
      "COSM12512\n",
      "COSM13766\n",
      "COSM13786\n",
      "COSM13675\n",
      "COSM13224\n",
      "COSM13723\n",
      "COSM13474\n",
      "COSM12505\n",
      "COSM785\n",
      "COSM238553\n",
      "COSM5564006\n",
      "COSM5015793\n",
      "COSM1673476\n",
      "COSM6196669\n",
      "COSM878\n",
      "COSM965\n",
      "COSM4766182\n"
     ]
    }
   ],
   "source": [
    "while read -r p; do\n",
    "    if ! grep -q $p COSM_matched_id_unique.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "22d892ef-d82b-49e7-a870-f909c3b4bce6",
   "metadata": {},
   "outputs": [],
   "source": [
    "echo 'COSM12475\n",
    "COSM12506\n",
    "COSM12512\n",
    "COSM13766\n",
    "COSM13786\n",
    "COSM13675\n",
    "COSM13224\n",
    "COSM13723\n",
    "COSM13474\n",
    "COSM12505\n",
    "COSM785\n",
    "COSM238553\n",
    "COSM5564006\n",
    "COSM5015793\n",
    "COSM1673476\n",
    "COSM6196669\n",
    "COSM878\n",
    "COSM965\n",
    "COSM4766182' > COSM_unmatched_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "de79b3a9-2aae-46c1-b5e9-3d5e6c6ea7ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      19 COSM_unmatched_id.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM_unmatched_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "46f5194d-d9db-4896-8d9b-5145e77b95ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed 's/$/\\t/' COSM_unmatched_id.txt > COSM_unmatched_tab_id.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "c1980527-71c4-460f-8798-7b75da61dab4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "grep -F -f COSM_unmatched_tab_id.txt Cosmic_CompleteTargetedScreensMutant_v101_GRCh37.tsv > COSM_unmatched.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "e30fa788-2015-4eb7-bf66-5b37a50531f3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     207 COSM_unmatched.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM_unmatched.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "e942fe15-a77f-49a1-be18-7f11f8b97cfd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       5 COSM_unmatched_id_unique.txt\n"
     ]
    }
   ],
   "source": [
    "cut -f 8 COSM_unmatched.tsv > COSM_unmatched_id_parsed.txt\n",
    "sort -u COSM_unmatched_id_parsed.txt > COSM_unmatched_id_unique.txt\n",
    "wc -l COSM_unmatched_id_unique.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee9c11d0-0ea2-4c3d-b25b-12cffff1d877",
   "metadata": {},
   "source": [
    "**Removing the COSM unmatched IDs from the text file**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "id": "5986b990-3b24-40ac-a8c9-4f2eca0c5203",
   "metadata": {},
   "outputs": [],
   "source": [
    "rm COSM/COSM_total_parsed.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "id": "c80b60e7-34e2-46bb-b6a4-7ad4732f7737",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat COSM/COSM_matched.tsv >> COSM/COSM_total_parsed.tsv\n",
    "cat COSM/COSM_unmatched.tsv >> COSM/COSM_total_parsed.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "id": "6daae3a8-092c-4652-ac4c-d78ed3c0bdea",
   "metadata": {},
   "outputs": [],
   "source": [
    "cp COSM/COSM_ids.txt COSM/COSM_ids_final.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "id": "9f91b602-8beb-4556-9605-893d92117faa",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COSM12475\n",
      "COSM12506\n",
      "COSM12512\n",
      "COSM13766\n",
      "COSM13675\n",
      "COSM13224\n",
      "COSM13474\n",
      "COSM238553\n",
      "COSM5564006\n",
      "COSM5015793\n",
      "COSM1673476\n",
      "COSM6196669\n",
      "COSM5159\n",
      "COSM5313\n",
      "COSM5154\n",
      "COSM5105\n",
      "COSM5204\n",
      "COSM5141\n",
      "COSM5283\n",
      "COSM5079\n",
      "COSM5046\n",
      "COSM86063\n",
      "COSM5142\n",
      "COSM5322\n",
      "COSM23625\n",
      "COSM3736941\n",
      "COSM5052\n",
      "COSM1167954\n",
      "COSM5143\n",
      "COSM5119\n",
      "COSM5148\n",
      "COSM861\n",
      "COSM878\n",
      "COSM859\n",
      "COSM860\n",
      "COSM862\n",
      "COSM864\n",
      "COSM965\n",
      "COSM1237919\n",
      "COSM13152\n",
      "COSM33076\n",
      "COSM17983\n",
      "COSM25676\n",
      "COSM17855\n",
      "COSM142849\n",
      "COSM4387483\n",
      "COSM4766182\n"
     ]
    }
   ],
   "source": [
    "while read -r p; do\n",
    "    if ! grep -q $p COSM/COSM_total_parsed.tsv; then\n",
    "        echo $p\n",
    "        sed -i '' '/'$p'/d' COSM/COSM_ids_final.txt\n",
    "    fi\n",
    "done < COSM/COSM_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "id": "98b263d1-0b58-4389-b043-accaf7b300db",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     132 COSM/COSM_ids_final.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM/COSM_ids_final.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "id": "d92b5492-3048-46e1-a32a-6a7c0137aa19",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     166 COSM/COSM_total_parsed_id_unique.txt\n"
     ]
    }
   ],
   "source": [
    "cut -f 8 COSM/COSM_total_parsed.tsv > COSM/COSM_total_parsed_id.txt\n",
    "sort -u COSM/COSM_total_parsed_id.txt > COSM/COSM_total_parsed_id_unique.txt\n",
    "wc -l COSM/COSM_total_parsed_id_unique.txt\n",
    "\n",
    "rm COSM/COSM_total_parsed_id.txt\n",
    "rm COSM/COSM_total_parsed_id_unique.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e6a0d3b-bc90-4164-8363-98f0eded180e",
   "metadata": {},
   "source": [
    "### Parsing the Matched TSV File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8febdfca-ae58-4874-a2ef-6446deb91273",
   "metadata": {},
   "source": [
    "Got it into excel and deleting columns that don't matter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "id": "e6a403e1-75be-48b6-b4f0-4d454cba047e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    1140 COSM/COSM_total_parsed.tsv\n"
     ]
    }
   ],
   "source": [
    "wc -l COSM/COSM_total_parsed.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "677cab77-1631-4035-9601-4c59d942a0c0",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8123bf3-f57c-4a1f-aa7a-25dea97371cb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "6bca1dde-ae4d-461f-b340-32807c35f8b3",
   "metadata": {},
   "source": [
    "### COSF"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5ec170c-a7f4-4216-b6ef-f7836232481f",
   "metadata": {},
   "source": [
    "download the COSF database from here https://cancer.sanger.ac.uk/cosmic/download/cosmic/v101/fusion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "0b12424c-c36c-4c52-b395-e93edec5d983",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "grep COSF parsed_variants.tsv | cut -f 3 > COSF/cosf_ids_temp.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "7b480d0f-3d6d-4ff5-98bb-1a9f4e570c4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "sort -u COSF/cosf_ids_temp.txt > COSF/cosf_ids_temp_uniq.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "09e3e382-150b-40da-93ab-71d6aff06fdf",
   "metadata": {},
   "outputs": [],
   "source": [
    "rm COSF/cosf_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "c0b89452-303c-4ec3-8f77-88d9576d8173",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "echo COSF$p >> COSF/cosf_ids.txt\n",
    "\n",
    "done < COSF/cosf_ids_temp_uniq.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "7778fc82-53fe-4ebd-a234-47434cc6bb3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "rm COSF/cosf_ids_temp.txt\n",
    "rm COSF/cosf_ids_temp_uniq.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "e8241f09-30a9-43e8-a4c3-0d1a340aad2f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COSF121\n",
      "COSF1216\n",
      "COSF1220\n",
      "COSF1224\n",
      "COSF1231\n",
      "COSF125\n",
      "COSF1271\n",
      "COSF128\n",
      "COSF1319\n",
      "COSF1320\n"
     ]
    }
   ],
   "source": [
    "head COSF/cosf_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "ab74bdbe-65f6-4596-a425-54868ed6859c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      65 COSF/cosf_ids.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l COSF/cosf_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f0fc6fc-41d2-4a64-b214-91d3359f0db0",
   "metadata": {},
   "outputs": [],
   "source": [
    "cat Cosmic_Fusion_v101_GRCh38.tsv >> COSF/Cosmic_Fusion.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "id": "01d86304-9e9a-4e83-a5cc-e8f221c8c36c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 18M\tCOSF/Cosmic_Fusion.tsv\n"
     ]
    }
   ],
   "source": [
    "du -h COSF/Cosmic_Fusion.tsv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "id": "e848ea6f-cec3-482d-8ff4-923dbbf6ce3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed 's/$/\\t/' COSF/cosf_ids.txt > COSF/cosf_ids_tab.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "id": "d2d8a7b5-6c29-459c-8ff4-94abed7250e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Extracted COSF entries saved to: COSF/kegg_data_cosf.tsv\n"
     ]
    }
   ],
   "source": [
    "#!/bin/bash\n",
    "\n",
    "# Paths (edit these as needed)\n",
    "COSF_ID_FILE=\"COSF/cosf_ids_tab.txt\"\n",
    "COSMIC_TSV=\"COSF/Cosmic_Fusion.tsv\"\n",
    "OUTPUT_TSV=\"COSF/kegg_data_cosf.tsv\"\n",
    "\n",
    "# Header based on README\n",
    "HEADER=\"COSMIC_SAMPLE_ID\\tSAMPLE_NAME\\tCOSMIC_PHENOTYPE_ID\\tCOSMIC_FUSION_ID\\tFUSION_SYNTAX\\tFIVE_PRIME_CHROMOSOME\\tFIVE_PRIME_STRAND\\tFIVE_PRIME_TRANSCRIPT_ID\\tFIVE_PRIME_GENE_SYMBOL\\tFIVE_PRIME_LAST_OBSERVE_EXON\\tFIVE_PRIME_GENOME_START_FROM\\tFIVE_PRIME_GENOME_START_TO\\tFIVE_PRIME_GENOME_STOP_FROM\\tFIVE_PRIME_GENOME_STOP_TO\\tTHREE_PRIME_CHROMOSOME\\tTHREE_PRIME_STRAND\\tTHREE_PRIME_TRANSCRIPT_ID\\tTHREE_PRIME_GENE_SYMBOL\\tTHREE_PRIME_FIRST_OBSERVE_EXON\\tTHREE_PRIME_GENOME_START_FROM\\tTHREE_PRIME_GENOME_START_TO\\tTHREE_PRIME_GENOME_STOP_FROM\\tTHREE_PRIME_GENOME_STOP_TO\\tFUSION_TYPE\\tPUBMED_PMID\"\n",
    "\n",
    "# Write header to output\n",
    "echo -e \"$HEADER\" > \"$OUTPUT_TSV\"\n",
    "\n",
    "grep -F -f $COSF_ID_FILE $COSMIC_TSV >> $OUTPUT_TSV\n",
    "\n",
    "echo \"✅ Extracted COSF entries saved to: $OUTPUT_TSV\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "id": "a1175ebc-4bcb-499d-bd07-8b0b77df9954",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      29 COSF/kegg_data_cosf_parsed_uniq.txt\n"
     ]
    }
   ],
   "source": [
    "cut -f 4 COSF/kegg_data_cosf.tsv > COSF/kegg_data_cosf_parsed.txt\n",
    "sort -u COSF/kegg_data_cosf_parsed.txt > COSF/kegg_data_cosf_parsed_uniq.txt\n",
    "wc -l COSF/kegg_data_cosf_parsed_uniq.txt\n",
    "\n",
    "rm COSF/kegg_data_cosf_parsed.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "id": "60e6f9d9-b4d9-4a90-9bfd-8523a26f2d85",
   "metadata": {},
   "outputs": [],
   "source": [
    "cp COSF/cosf_ids.txt COSF/cosf_ids_final.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "id": "0673d326-99a6-4957-82e4-2151b4a5f2aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COSF1220\n",
      "COSF1224\n",
      "COSF125\n",
      "COSF128\n",
      "COSF1330\n",
      "COSF1490\n",
      "COSF154\n",
      "COSF155\n",
      "COSF166\n",
      "COSF168\n",
      "COSF1756\n",
      "COSF1758\n",
      "COSF1805\n",
      "COSF187\n",
      "COSF189\n",
      "COSF1949\n",
      "COSF1960\n",
      "COSF2067\n",
      "COSF2124\n",
      "COSF218\n",
      "COSF220\n",
      "COSF2246\n",
      "COSF2248\n",
      "COSF248\n",
      "COSF300\n",
      "COSF302\n",
      "COSF355\n",
      "COSF356\n",
      "COSF394\n",
      "COSF396\n",
      "COSF463\n",
      "COSF501\n",
      "COSF504\n",
      "COSF528\n",
      "COSF806\n",
      "COSF808\n"
     ]
    }
   ],
   "source": [
    "while read -r p; do\n",
    "    if ! grep -q $p COSF/kegg_data_cosf_parsed_uniq.txt; then\n",
    "        echo $p\n",
    "        sed -i '' '/'$p'/d' COSF/cosf_ids_final.txt\n",
    "    fi\n",
    "done < COSF/cosf_ids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "id": "e1b77653-09e4-40e6-8ac3-2cf3be1442cf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      29 COSF/cosf_ids_final.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l COSF/cosf_ids_final.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "670cec3f-237f-4789-a725-7d4d5a366815",
   "metadata": {},
   "source": [
    "I was looking at the data and they don't give any proper ways to get the exact nt sequence, so I am leaving this out."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e18961b9-735c-47c2-bd68-0b6184c05375",
   "metadata": {},
   "source": [
    "# Matching Variant and Nt sequence to each Network/Pathway"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be8b5cfb-3309-4abf-a7ae-250efae122a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "429d5f3b-b9e2-4ac4-9992-e6755a578bf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "c8cee289-0c1c-4091-81d8-37121fcd8644",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10133v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>10133</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>268075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>150740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v1</td>\n",
       "      <td>dbVar</td>\n",
       "      <td>nsv917029</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>783</th>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196638</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>784</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>785</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1379150</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>786</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>788 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source         ID\n",
       "0    10133v1  OmimVar      10133\n",
       "1     1019v1  ClinVar     268075\n",
       "2     1019v1  ClinVar     150740\n",
       "3     1019v1    dbVar  nsv917029\n",
       "4     1019v2  ClinVar      16928\n",
       "..       ...      ...        ...\n",
       "783   9817v1     COSM    6196638\n",
       "784    999v2     COSM    4766182\n",
       "785    999v2     COSM    1379150\n",
       "786    999v2     COSM    4766211\n",
       "787    999v2     COSM    4766271\n",
       "\n",
       "[788 rows x 3 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parsed_variants = pd.read_csv(\"parsed_variants.tsv\", sep='\\t')\n",
    "parsed_variants"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0329361-f9df-471f-8ff0-a7265ada0ad2",
   "metadata": {},
   "source": [
    "### ClinVar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "7003f600-8b18-4f50-b041-3dd71af940e2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>183391</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12717896</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCC</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>183393</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718044</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>183395</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718210</td>\n",
       "      <td>CTCT</td>\n",
       "      <td>CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>4886</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67483197</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>4892</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67490803</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>161992</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490442</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>161993</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490443</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>161995</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>185 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         ID        seq_id  position                   ref  \\\n",
       "0     16928  NC_000012.12  57751647                     G   \n",
       "1     16929  NC_000012.12  57751646                     C   \n",
       "2    183391  NC_000012.12  12717896  CAGGCGGAGCACCCCAAGCC   \n",
       "3    183393  NC_000012.12  12718044                     C   \n",
       "4    183395  NC_000012.12  12718210                  CTCT   \n",
       "..      ...           ...       ...                   ...   \n",
       "180    4886  NC_000011.10  67483197                     C   \n",
       "181    4892  NC_000011.10  67490803                     C   \n",
       "182  161992  NC_000015.10  50490442                     T   \n",
       "183  161993  NC_000015.10  50490443                     C   \n",
       "184  161995  NC_000015.10  50490449                     C   \n",
       "\n",
       "                                         alt  \n",
       "0                                          A  \n",
       "1                                          T  \n",
       "2    CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC  \n",
       "3                                          T  \n",
       "4                                         CT  \n",
       "..                                       ...  \n",
       "180                                        T  \n",
       "181                                        A  \n",
       "182                                        C  \n",
       "183                                        G  \n",
       "184                                        G  \n",
       "\n",
       "[185 rows x 5 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_data = pd.read_csv(\"ClinVar_parsed_output.tsv\",sep='\\t')\n",
    "clinvar_data = clinvar_data.rename(columns={\"ClinVar_ID\": \"ID\"})\n",
    "clinvar_data['ID'] = clinvar_data['ID'].astype('string')\n",
    "clinvar_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f0319623-fcb7-4d14-a4ec-cbb9f117af50",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of missing ClinVar variant is 49\n"
     ]
    }
   ],
   "source": [
    "# Ensure ClinVar_ID is treated as string to avoid dtype mismatch\n",
    "clinvar_ids = clinvar_data[\"ID\"].astype(str).unique()\n",
    "\n",
    "missing_num = 0\n",
    "\n",
    "# Iterate and print missing ClinVar IDs\n",
    "for _, row in parsed_variants.iterrows():\n",
    "    if row[\"Source\"] == \"ClinVar\" and str(row[\"ID\"]) not in clinvar_ids:\n",
    "        missing_num+=1\n",
    "print(f'Number of missing ClinVar variant is {missing_num}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "0d757a26-2a7d-41a8-9914-4bcbd8f166f3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183391</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12717896</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCC</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183393</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718044</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183395</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718210</td>\n",
       "      <td>CTCT</td>\n",
       "      <td>CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>9049v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>4886</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67483197</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>9049v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>4892</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67490803</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161992</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490442</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161993</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490443</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161995</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>185 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      ENTRY   Source      ID        seq_id  position                   ref  \\\n",
       "0    1019v2  ClinVar   16928  NC_000012.12  57751647                     G   \n",
       "1    1019v2  ClinVar   16929  NC_000012.12  57751646                     C   \n",
       "2    1027v3  ClinVar  183391  NC_000012.12  12717896  CAGGCGGAGCACCCCAAGCC   \n",
       "3    1027v3  ClinVar  183393  NC_000012.12  12718044                     C   \n",
       "4    1027v3  ClinVar  183395  NC_000012.12  12718210                  CTCT   \n",
       "..      ...      ...     ...           ...       ...                   ...   \n",
       "180  9049v1  ClinVar    4886  NC_000011.10  67483197                     C   \n",
       "181  9049v1  ClinVar    4892  NC_000011.10  67490803                     C   \n",
       "182  9101v1  ClinVar  161992  NC_000015.10  50490442                     T   \n",
       "183  9101v1  ClinVar  161993  NC_000015.10  50490443                     C   \n",
       "184  9101v1  ClinVar  161995  NC_000015.10  50490449                     C   \n",
       "\n",
       "                                         alt  \n",
       "0                                          A  \n",
       "1                                          T  \n",
       "2    CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC  \n",
       "3                                          T  \n",
       "4                                         CT  \n",
       "..                                       ...  \n",
       "180                                        T  \n",
       "181                                        A  \n",
       "182                                        C  \n",
       "183                                        G  \n",
       "184                                        G  \n",
       "\n",
       "[185 rows x 7 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_final = parsed_variants.merge(clinvar_data, on='ID')\n",
    "clinvar_final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0942ebc8-dfae-4f96-9762-2b83c01b5e29",
   "metadata": {},
   "source": [
    "### dbSNP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "33f3303c-51a4-46d2-9941-8331aee362c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>dbsnp_id</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>104311</td>\n",
       "      <td>rs661</td>\n",
       "      <td>NC_000014.9</td>\n",
       "      <td>73217224</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>104311</td>\n",
       "      <td>rs661</td>\n",
       "      <td>NC_000014.9</td>\n",
       "      <td>73217224</td>\n",
       "      <td>G</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>606463</td>\n",
       "      <td>rs364897</td>\n",
       "      <td>NC_000001.11</td>\n",
       "      <td>155238214</td>\n",
       "      <td>T</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>606463</td>\n",
       "      <td>rs364897</td>\n",
       "      <td>NC_000001.11</td>\n",
       "      <td>155238214</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>606463</td>\n",
       "      <td>rs368060</td>\n",
       "      <td>NC_000001.11</td>\n",
       "      <td>155235216</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1403</th>\n",
       "      <td>rs672601307</td>\n",
       "      <td>rs672601307</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490442</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1404</th>\n",
       "      <td>rs672601308</td>\n",
       "      <td>rs672601308</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490443</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1405</th>\n",
       "      <td>rs672601308</td>\n",
       "      <td>rs672601308</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490443</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1406</th>\n",
       "      <td>rs672601311</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1407</th>\n",
       "      <td>rs672601311</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1408 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "               ID     dbsnp_id        seq_id   position ref alt\n",
       "0          104311        rs661   NC_000014.9   73217224   G   A\n",
       "1          104311        rs661   NC_000014.9   73217224   G   T\n",
       "2          606463     rs364897  NC_000001.11  155238214   T   A\n",
       "3          606463     rs364897  NC_000001.11  155238214   T   C\n",
       "4          606463     rs368060  NC_000001.11  155235216   C   G\n",
       "...           ...          ...           ...        ...  ..  ..\n",
       "1403  rs672601307  rs672601307  NC_000015.10   50490442   T   C\n",
       "1404  rs672601308  rs672601308  NC_000015.10   50490443   C   G\n",
       "1405  rs672601308  rs672601308  NC_000015.10   50490443   C   T\n",
       "1406  rs672601311  rs672601311  NC_000015.10   50490449   C   G\n",
       "1407  rs672601311  rs672601311  NC_000015.10   50490449   C   T\n",
       "\n",
       "[1408 rows x 6 columns]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dbsnp_data = pd.read_csv(\"dbSNP_output.tsv\",sep='\\t')\n",
    "dbsnp_data = dbsnp_data.rename(columns={\"True Id\": \"ID\",\"sequence_id\":'seq_id'})\n",
    "dbsnp_data['ID'] = dbsnp_data['ID'].astype('string')\n",
    "dbsnp_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "0223dcf2-3eba-43a1-847a-709f17b069a1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of missing dbSNP and OmimVar variant is 244\n"
     ]
    }
   ],
   "source": [
    "# Ensure ClinVar_ID is treated as string to avoid dtype mismatch\n",
    "dbsnp_data_ids = dbsnp_data[\"ID\"].astype(str).unique()\n",
    "\n",
    "missing_num = 0\n",
    "\n",
    "# Iterate and print missing ClinVar IDs\n",
    "for _, row in parsed_variants.iterrows():\n",
    "    if (row[\"Source\"] == \"dbSNP\" or row[\"Source\"] == \"OmimVar\") and str(row[\"ID\"]) not in clinvar_ids:\n",
    "        missing_num+=1\n",
    "print(f'Number of missing dbSNP and OmimVar variant is {missing_num}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "b146e274-ef72-42d0-ba06-c859786bda89",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>dbsnp_id</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1417</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1419</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>rs74315431</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418317</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1420</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>rs281875284</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1421</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>rs281875284</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1422 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source           ID     dbsnp_id        seq_id  position ref  \\\n",
       "0     1019v2    dbSNP   rs11547328   rs11547328  NC_000012.12  57751647   G   \n",
       "1     1019v2    dbSNP   rs11547328   rs11547328  NC_000012.12  57751647   G   \n",
       "2     1019v2    dbSNP   rs11547328   rs11547328  NC_000012.12  57751647   G   \n",
       "3     1019v2    dbSNP  rs104894340  rs104894340  NC_000012.12  57751646   C   \n",
       "4     1019v2    dbSNP  rs104894340  rs104894340  NC_000012.12  57751646   C   \n",
       "...      ...      ...          ...          ...           ...       ...  ..   \n",
       "1417  9101v1    dbSNP  rs672601311  rs672601311  NC_000015.10  50490449   C   \n",
       "1418  9101v1    dbSNP  rs672601311  rs672601311  NC_000015.10  50490449   C   \n",
       "1419  9217v1  OmimVar       605704   rs74315431  NC_000020.11  58418317   C   \n",
       "1420  9217v1  OmimVar       605704  rs281875284  NC_000020.11  58418288   C   \n",
       "1421  9217v1  OmimVar       605704  rs281875284  NC_000020.11  58418288   C   \n",
       "\n",
       "     alt  \n",
       "0      A  \n",
       "1      C  \n",
       "2      T  \n",
       "3      A  \n",
       "4      G  \n",
       "...   ..  \n",
       "1417   G  \n",
       "1418   T  \n",
       "1419   T  \n",
       "1420   G  \n",
       "1421   T  \n",
       "\n",
       "[1422 rows x 8 columns]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dbsnp_final = parsed_variants.merge(dbsnp_data, on='ID')\n",
    "dbsnp_final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f41d1e1-6e7e-49c7-81b8-456875e0f40a",
   "metadata": {},
   "source": [
    "### COSM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "99aa0a39-ce5d-4c4e-90a9-7b8f36672843",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gene</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>COSMID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>AAChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>Strand</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "      <th>ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CTNNB1</td>\n",
       "      <td>ENST00000643031.1</td>\n",
       "      <td>COSM5692</td>\n",
       "      <td>c.134C&gt;A</td>\n",
       "      <td>p.S45Y</td>\n",
       "      <td>3</td>\n",
       "      <td>41224646</td>\n",
       "      <td>41224646</td>\n",
       "      <td>+</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>5692</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CTNNB1</td>\n",
       "      <td>ENST00000642248.1</td>\n",
       "      <td>COSM5689</td>\n",
       "      <td>c.134C&gt;G</td>\n",
       "      <td>p.S45C</td>\n",
       "      <td>3</td>\n",
       "      <td>41224646</td>\n",
       "      <td>41224646</td>\n",
       "      <td>+</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>5689</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000579755.1</td>\n",
       "      <td>COSM13508</td>\n",
       "      <td>c.375G&gt;A</td>\n",
       "      <td>p.G125=</td>\n",
       "      <td>9</td>\n",
       "      <td>21971027</td>\n",
       "      <td>21971027</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>13508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CTNNB1</td>\n",
       "      <td>ENST00000396183.7</td>\n",
       "      <td>COSM5681</td>\n",
       "      <td>c.95A&gt;G</td>\n",
       "      <td>p.D32G</td>\n",
       "      <td>3</td>\n",
       "      <td>41224607</td>\n",
       "      <td>41224607</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>5681</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000530628.2</td>\n",
       "      <td>COSM13807</td>\n",
       "      <td>c.389G&gt;T</td>\n",
       "      <td>p.G130V</td>\n",
       "      <td>9</td>\n",
       "      <td>21971013</td>\n",
       "      <td>21971013</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>13807</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1134</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000579755.1</td>\n",
       "      <td>COSM13723</td>\n",
       "      <td>c.308G&gt;A</td>\n",
       "      <td>p.G103E</td>\n",
       "      <td>9</td>\n",
       "      <td>21971093</td>\n",
       "      <td>21971093</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>13723</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1135</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000578845.2</td>\n",
       "      <td>COSM13723</td>\n",
       "      <td>c.112G&gt;A</td>\n",
       "      <td>p.G38S</td>\n",
       "      <td>9</td>\n",
       "      <td>21971093</td>\n",
       "      <td>21971093</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>13723</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1136</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000579122.1</td>\n",
       "      <td>COSM12505</td>\n",
       "      <td>c.59C&gt;A</td>\n",
       "      <td>p.A20E</td>\n",
       "      <td>9</td>\n",
       "      <td>21974768</td>\n",
       "      <td>21974768</td>\n",
       "      <td>-</td>\n",
       "      <td>G</td>\n",
       "      <td>T</td>\n",
       "      <td>12505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1137</th>\n",
       "      <td>FLT3</td>\n",
       "      <td>ENST00000380982.4</td>\n",
       "      <td>COSM785</td>\n",
       "      <td>c.2503G&gt;C</td>\n",
       "      <td>p.D835H</td>\n",
       "      <td>13</td>\n",
       "      <td>28592642</td>\n",
       "      <td>28592642</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>785</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1138</th>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>ENST00000579122.1</td>\n",
       "      <td>COSM13723</td>\n",
       "      <td>c.265G&gt;A</td>\n",
       "      <td>p.G89S</td>\n",
       "      <td>9</td>\n",
       "      <td>21971093</td>\n",
       "      <td>21971093</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>13723</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1139 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Gene       TranscriptID     COSMID  NucChange AAChange Chr     Start  \\\n",
       "0     CTNNB1  ENST00000643031.1   COSM5692   c.134C>A   p.S45Y   3  41224646   \n",
       "1     CTNNB1  ENST00000642248.1   COSM5689   c.134C>G   p.S45C   3  41224646   \n",
       "2     CDKN2A  ENST00000579755.1  COSM13508   c.375G>A  p.G125=   9  21971027   \n",
       "3     CTNNB1  ENST00000396183.7   COSM5681    c.95A>G   p.D32G   3  41224607   \n",
       "4     CDKN2A  ENST00000530628.2  COSM13807   c.389G>T  p.G130V   9  21971013   \n",
       "...      ...                ...        ...        ...      ...  ..       ...   \n",
       "1134  CDKN2A  ENST00000579755.1  COSM13723   c.308G>A  p.G103E   9  21971093   \n",
       "1135  CDKN2A  ENST00000578845.2  COSM13723   c.112G>A   p.G38S   9  21971093   \n",
       "1136  CDKN2A  ENST00000579122.1  COSM12505    c.59C>A   p.A20E   9  21974768   \n",
       "1137    FLT3  ENST00000380982.4    COSM785  c.2503G>C  p.D835H  13  28592642   \n",
       "1138  CDKN2A  ENST00000579122.1  COSM13723   c.265G>A   p.G89S   9  21971093   \n",
       "\n",
       "           End Strand RefAllele AltAllele     ID  \n",
       "0     41224646      +         C         A   5692  \n",
       "1     41224646      +         C         G   5689  \n",
       "2     21971027      -         C         T  13508  \n",
       "3     41224607      +         A         G   5681  \n",
       "4     21971013      -         C         A  13807  \n",
       "...        ...    ...       ...       ...    ...  \n",
       "1134  21971093      -         C         T  13723  \n",
       "1135  21971093      -         C         T  13723  \n",
       "1136  21974768      -         G         T  12505  \n",
       "1137  28592642      -         C         G    785  \n",
       "1138  21971093      -         C         T  13723  \n",
       "\n",
       "[1139 rows x 12 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cosm_data = pd.read_csv(\"COSM/COSM_total_parsed.tsv\",sep='\\t')\n",
    "cosm_data['ID'] = cosm_data['COSMID'].str[4:]\n",
    "cosm_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "cc7b32a9-6742-4cfd-a075-cf61fc098cc4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of missing COSM variant is 202\n"
     ]
    }
   ],
   "source": [
    "# Ensure ClinVar_ID is treated as string to avoid dtype mismatch\n",
    "cosm_data_ids = cosm_data[\"ID\"].astype(str).unique()\n",
    "\n",
    "missing_num = 0\n",
    "\n",
    "# Iterate and print missing ClinVar IDs\n",
    "for _, row in parsed_variants.iterrows():\n",
    "    if row[\"Source\"] == \"COSM\"  and str(row[\"ID\"]) not in clinvar_ids:\n",
    "        missing_num+=1\n",
    "print(f'Number of missing COSM variant is {missing_num}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "322fb3ef-866a-4f5b-8a2d-3ee5d33e6276",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>Gene</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>COSMID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>AAChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>Strand</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>COSM1677139</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>p.R24C</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>-</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>COSM1677139</td>\n",
       "      <td>c.-158+527C&gt;T</td>\n",
       "      <td>p.?</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>-</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>ENST00000257904.10</td>\n",
       "      <td>COSM1677139</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>p.R24C</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>-</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>COSM1989836</td>\n",
       "      <td>c.71G&gt;A</td>\n",
       "      <td>p.R24H</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>COSM1989836</td>\n",
       "      <td>c.-158+528G&gt;A</td>\n",
       "      <td>p.?</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>-</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1134</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>ENST00000612417.4</td>\n",
       "      <td>COSM4766271</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>p.D221G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1135</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>ENST00000611625.4</td>\n",
       "      <td>COSM4766271</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>p.D221G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1136</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>ENST00000422392.6</td>\n",
       "      <td>COSM4766271</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>p.D221G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1137</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>COSM4766271</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>p.D221G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1138</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>ENST00000261769.9</td>\n",
       "      <td>COSM4766271</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>p.D221G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>+</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1139 rows × 14 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY Source       ID  Gene        TranscriptID       COSMID  \\\n",
       "0     1019v2   COSM  1677139  CDK4  ENST00000312990.10  COSM1677139   \n",
       "1     1019v2   COSM  1677139  CDK4   ENST00000549606.5  COSM1677139   \n",
       "2     1019v2   COSM  1677139  CDK4  ENST00000257904.10  COSM1677139   \n",
       "3     1019v2   COSM  1989836  CDK4  ENST00000312990.10  COSM1989836   \n",
       "4     1019v2   COSM  1989836  CDK4   ENST00000549606.5  COSM1989836   \n",
       "...      ...    ...      ...   ...                 ...          ...   \n",
       "1134   999v2   COSM  4766271  CDH1   ENST00000612417.4  COSM4766271   \n",
       "1135   999v2   COSM  4766271  CDH1   ENST00000611625.4  COSM4766271   \n",
       "1136   999v2   COSM  4766271  CDH1   ENST00000422392.6  COSM4766271   \n",
       "1137   999v2   COSM  4766271  CDH1   ENST00000621016.4  COSM4766271   \n",
       "1138   999v2   COSM  4766271  CDH1   ENST00000261769.9  COSM4766271   \n",
       "\n",
       "          NucChange AAChange Chr     Start       End Strand RefAllele  \\\n",
       "0           c.70C>T   p.R24C  12  57751648  57751648      -         G   \n",
       "1     c.-158+527C>T      p.?  12  57751648  57751648      -         G   \n",
       "2           c.70C>T   p.R24C  12  57751648  57751648      -         G   \n",
       "3           c.71G>A   p.R24H  12  57751647  57751647      -         C   \n",
       "4     c.-158+528G>A      p.?  12  57751647  57751647      -         C   \n",
       "...             ...      ...  ..       ...       ...    ...       ...   \n",
       "1134       c.662A>G  p.D221G  16  68808823  68808823      +         A   \n",
       "1135       c.662A>G  p.D221G  16  68808823  68808823      +         A   \n",
       "1136       c.662A>G  p.D221G  16  68808823  68808823      +         A   \n",
       "1137       c.662A>G  p.D221G  16  68808823  68808823      +         A   \n",
       "1138       c.662A>G  p.D221G  16  68808823  68808823      +         A   \n",
       "\n",
       "     AltAllele  \n",
       "0            A  \n",
       "1            A  \n",
       "2            A  \n",
       "3            T  \n",
       "4            T  \n",
       "...        ...  \n",
       "1134         G  \n",
       "1135         G  \n",
       "1136         G  \n",
       "1137         G  \n",
       "1138         G  \n",
       "\n",
       "[1139 rows x 14 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cosm_final = parsed_variants.merge(cosm_data, on='ID')\n",
    "cosm_final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83db14d7-1688-4b9d-a47b-0afec2f57a10",
   "metadata": {},
   "source": [
    "## Combining them together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "468461b0-1b06-4993-9950-e5ba26b11aa0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183391</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12717896</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCC</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183393</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718044</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183395</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718210</td>\n",
       "      <td>CTCT</td>\n",
       "      <td>CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>9049v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>4886</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67483197</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>9049v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>4892</td>\n",
       "      <td>NC_000011.10</td>\n",
       "      <td>67490803</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161992</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490442</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161993</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490443</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>161995</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>185 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      ENTRY   Source      ID        seq_id  position                   ref  \\\n",
       "0    1019v2  ClinVar   16928  NC_000012.12  57751647                     G   \n",
       "1    1019v2  ClinVar   16929  NC_000012.12  57751646                     C   \n",
       "2    1027v3  ClinVar  183391  NC_000012.12  12717896  CAGGCGGAGCACCCCAAGCC   \n",
       "3    1027v3  ClinVar  183393  NC_000012.12  12718044                     C   \n",
       "4    1027v3  ClinVar  183395  NC_000012.12  12718210                  CTCT   \n",
       "..      ...      ...     ...           ...       ...                   ...   \n",
       "180  9049v1  ClinVar    4886  NC_000011.10  67483197                     C   \n",
       "181  9049v1  ClinVar    4892  NC_000011.10  67490803                     C   \n",
       "182  9101v1  ClinVar  161992  NC_000015.10  50490442                     T   \n",
       "183  9101v1  ClinVar  161993  NC_000015.10  50490443                     C   \n",
       "184  9101v1  ClinVar  161995  NC_000015.10  50490449                     C   \n",
       "\n",
       "                                         alt  \n",
       "0                                          A  \n",
       "1                                          T  \n",
       "2    CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC  \n",
       "3                                          T  \n",
       "4                                         CT  \n",
       "..                                       ...  \n",
       "180                                        T  \n",
       "181                                        A  \n",
       "182                                        C  \n",
       "183                                        G  \n",
       "184                                        G  \n",
       "\n",
       "[185 rows x 7 columns]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_final"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "6cda9996-9725-4993-ab9e-ca8b74ced30a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1417</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1419</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418317</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1420</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1421</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1422 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source           ID        seq_id  position ref alt\n",
       "0     1019v2    dbSNP   rs11547328  NC_000012.12  57751647   G   A\n",
       "1     1019v2    dbSNP   rs11547328  NC_000012.12  57751647   G   C\n",
       "2     1019v2    dbSNP   rs11547328  NC_000012.12  57751647   G   T\n",
       "3     1019v2    dbSNP  rs104894340  NC_000012.12  57751646   C   A\n",
       "4     1019v2    dbSNP  rs104894340  NC_000012.12  57751646   C   G\n",
       "...      ...      ...          ...           ...       ...  ..  ..\n",
       "1417  9101v1    dbSNP  rs672601311  NC_000015.10  50490449   C   G\n",
       "1418  9101v1    dbSNP  rs672601311  NC_000015.10  50490449   C   T\n",
       "1419  9217v1  OmimVar       605704  NC_000020.11  58418317   C   T\n",
       "1420  9217v1  OmimVar       605704  NC_000020.11  58418288   C   G\n",
       "1421  9217v1  OmimVar       605704  NC_000020.11  58418288   C   T\n",
       "\n",
       "[1422 rows x 7 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dbsnp_final = dbsnp_final.drop(columns=['dbsnp_id'])\n",
    "dbsnp_final"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "fac7aacc-79c6-4cef-8d4e-7be27bca34e2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>seq_id</th>\n",
       "      <th>position</th>\n",
       "      <th>ref</th>\n",
       "      <th>alt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183391</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12717896</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCC</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183393</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718044</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183395</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12718210</td>\n",
       "      <td>CTCT</td>\n",
       "      <td>CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1417</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1419</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418317</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1420</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1421</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1607 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source           ID        seq_id  position  \\\n",
       "0     1019v2  ClinVar        16928  NC_000012.12  57751647   \n",
       "1     1019v2  ClinVar        16929  NC_000012.12  57751646   \n",
       "2     1027v3  ClinVar       183391  NC_000012.12  12717896   \n",
       "3     1027v3  ClinVar       183393  NC_000012.12  12718044   \n",
       "4     1027v3  ClinVar       183395  NC_000012.12  12718210   \n",
       "...      ...      ...          ...           ...       ...   \n",
       "1417  9101v1    dbSNP  rs672601311  NC_000015.10  50490449   \n",
       "1418  9101v1    dbSNP  rs672601311  NC_000015.10  50490449   \n",
       "1419  9217v1  OmimVar       605704  NC_000020.11  58418317   \n",
       "1420  9217v1  OmimVar       605704  NC_000020.11  58418288   \n",
       "1421  9217v1  OmimVar       605704  NC_000020.11  58418288   \n",
       "\n",
       "                       ref                                      alt  \n",
       "0                        G                                        A  \n",
       "1                        C                                        T  \n",
       "2     CAGGCGGAGCACCCCAAGCC  CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC  \n",
       "3                        C                                        T  \n",
       "4                     CTCT                                       CT  \n",
       "...                    ...                                      ...  \n",
       "1417                     C                                        G  \n",
       "1418                     C                                        T  \n",
       "1419                     C                                        T  \n",
       "1420                     C                                        G  \n",
       "1421                     C                                        T  \n",
       "\n",
       "[1607 rows x 7 columns]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_dbsnp = pd.concat([clinvar_final, dbsnp_final])\n",
    "clinvar_dbsnp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "6956d533-6f6e-4526-9350-f4db614e14da",
   "metadata": {},
   "outputs": [],
   "source": [
    "clinvar_dbsnp = clinvar_dbsnp.rename(columns={\"seq_id\":\"TranscriptID\",\"position\":\"Start\",\"ref\":\"RefAllele\",\"alt\":\"AltAllele\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "343b56f9-f078-47e6-9c88-5ceff0a8b537",
   "metadata": {},
   "outputs": [],
   "source": [
    "clinvar_dbsnp[\"End\"] = clinvar_dbsnp[\"Start\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "8717fee6-a548-487b-87e9-f1fef6d2429a",
   "metadata": {},
   "outputs": [],
   "source": [
    "clinvar_dbsnp['Chr'] = clinvar_dbsnp['TranscriptID'].str[7:9].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "238baf78-d5f4-4c2b-80b4-ce2b4df67cde",
   "metadata": {},
   "outputs": [],
   "source": [
    "clinvar_dbsnp = clinvar_dbsnp[['ENTRY', 'Source', 'ID', 'TranscriptID','Chr', 'Start', 'End','RefAllele','AltAllele']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "a1482c60-3453-4f16-89bd-9fc73ba6b622",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183391</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12</td>\n",
       "      <td>12717896</td>\n",
       "      <td>12717896</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCC</td>\n",
       "      <td>CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183393</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12</td>\n",
       "      <td>12718044</td>\n",
       "      <td>12718044</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>183395</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>12</td>\n",
       "      <td>12718210</td>\n",
       "      <td>12718210</td>\n",
       "      <td>CTCT</td>\n",
       "      <td>CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1417</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>15</td>\n",
       "      <td>50490449</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>15</td>\n",
       "      <td>50490449</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1419</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>20</td>\n",
       "      <td>58418317</td>\n",
       "      <td>58418317</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1420</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>20</td>\n",
       "      <td>58418288</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1421</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>20</td>\n",
       "      <td>58418288</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1607 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source           ID  TranscriptID  Chr     Start       End  \\\n",
       "0     1019v2  ClinVar        16928  NC_000012.12   12  57751647  57751647   \n",
       "1     1019v2  ClinVar        16929  NC_000012.12   12  57751646  57751646   \n",
       "2     1027v3  ClinVar       183391  NC_000012.12   12  12717896  12717896   \n",
       "3     1027v3  ClinVar       183393  NC_000012.12   12  12718044  12718044   \n",
       "4     1027v3  ClinVar       183395  NC_000012.12   12  12718210  12718210   \n",
       "...      ...      ...          ...           ...  ...       ...       ...   \n",
       "1417  9101v1    dbSNP  rs672601311  NC_000015.10   15  50490449  50490449   \n",
       "1418  9101v1    dbSNP  rs672601311  NC_000015.10   15  50490449  50490449   \n",
       "1419  9217v1  OmimVar       605704  NC_000020.11   20  58418317  58418317   \n",
       "1420  9217v1  OmimVar       605704  NC_000020.11   20  58418288  58418288   \n",
       "1421  9217v1  OmimVar       605704  NC_000020.11   20  58418288  58418288   \n",
       "\n",
       "                 RefAllele                                AltAllele  \n",
       "0                        G                                        A  \n",
       "1                        C                                        T  \n",
       "2     CAGGCGGAGCACCCCAAGCC  CAGGCGGAGCACCCCAAGCCAGGCGGAGCACCCCAAGCC  \n",
       "3                        C                                        T  \n",
       "4                     CTCT                                       CT  \n",
       "...                    ...                                      ...  \n",
       "1417                     C                                        G  \n",
       "1418                     C                                        T  \n",
       "1419                     C                                        T  \n",
       "1420                     C                                        G  \n",
       "1421                     C                                        T  \n",
       "\n",
       "[1607 rows x 9 columns]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_dbsnp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "8207bc04-9787-4621-818e-7e9cc17770ce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>c.-158+527C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000257904.10</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>c.71G&gt;A</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>c.-158+528G&gt;A</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1134</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000612417.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1135</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000611625.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1136</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000422392.6</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1137</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1138</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000261769.9</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1139 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY Source       ID        TranscriptID      NucChange Chr     Start  \\\n",
       "0     1019v2   COSM  1677139  ENST00000312990.10        c.70C>T  12  57751648   \n",
       "1     1019v2   COSM  1677139   ENST00000549606.5  c.-158+527C>T  12  57751648   \n",
       "2     1019v2   COSM  1677139  ENST00000257904.10        c.70C>T  12  57751648   \n",
       "3     1019v2   COSM  1989836  ENST00000312990.10        c.71G>A  12  57751647   \n",
       "4     1019v2   COSM  1989836   ENST00000549606.5  c.-158+528G>A  12  57751647   \n",
       "...      ...    ...      ...                 ...            ...  ..       ...   \n",
       "1134   999v2   COSM  4766271   ENST00000612417.4       c.662A>G  16  68808823   \n",
       "1135   999v2   COSM  4766271   ENST00000611625.4       c.662A>G  16  68808823   \n",
       "1136   999v2   COSM  4766271   ENST00000422392.6       c.662A>G  16  68808823   \n",
       "1137   999v2   COSM  4766271   ENST00000621016.4       c.662A>G  16  68808823   \n",
       "1138   999v2   COSM  4766271   ENST00000261769.9       c.662A>G  16  68808823   \n",
       "\n",
       "           End RefAllele AltAllele  \n",
       "0     57751648         G         A  \n",
       "1     57751648         G         A  \n",
       "2     57751648         G         A  \n",
       "3     57751647         C         T  \n",
       "4     57751647         C         T  \n",
       "...        ...       ...       ...  \n",
       "1134  68808823         A         G  \n",
       "1135  68808823         A         G  \n",
       "1136  68808823         A         G  \n",
       "1137  68808823         A         G  \n",
       "1138  68808823         A         G  \n",
       "\n",
       "[1139 rows x 10 columns]"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cosm_final = cosm_final.drop(columns={\"Gene\",\"COSMID\",\"AAChange\",\"Strand\"})\n",
    "cosm_final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9c0fd4d-14e1-4018-b848-678717d265f0",
   "metadata": {},
   "source": [
    "**Final Concatenation**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "ced3d93a-e5a8-4283-8925-8257540a5e99",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data = pd.concat([cosm_final,clinvar_dbsnp])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "8e2d6e3e-38ca-4aef-bc32-7f7ea3f45126",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>c.-158+527C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1677139</td>\n",
       "      <td>ENST00000257904.10</td>\n",
       "      <td>c.70C&gt;T</td>\n",
       "      <td>12</td>\n",
       "      <td>57751648</td>\n",
       "      <td>57751648</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>ENST00000312990.10</td>\n",
       "      <td>c.71G&gt;A</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1989836</td>\n",
       "      <td>ENST00000549606.5</td>\n",
       "      <td>c.-158+528G&gt;A</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1417</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15</td>\n",
       "      <td>50490449</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1418</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs672601311</td>\n",
       "      <td>NC_000015.10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15</td>\n",
       "      <td>50490449</td>\n",
       "      <td>50490449</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1419</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20</td>\n",
       "      <td>58418317</td>\n",
       "      <td>58418317</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1420</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20</td>\n",
       "      <td>58418288</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1421</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>OmimVar</td>\n",
       "      <td>605704</td>\n",
       "      <td>NC_000020.11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20</td>\n",
       "      <td>58418288</td>\n",
       "      <td>58418288</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2746 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ENTRY   Source           ID        TranscriptID      NucChange Chr  \\\n",
       "0     1019v2     COSM      1677139  ENST00000312990.10        c.70C>T  12   \n",
       "1     1019v2     COSM      1677139   ENST00000549606.5  c.-158+527C>T  12   \n",
       "2     1019v2     COSM      1677139  ENST00000257904.10        c.70C>T  12   \n",
       "3     1019v2     COSM      1989836  ENST00000312990.10        c.71G>A  12   \n",
       "4     1019v2     COSM      1989836   ENST00000549606.5  c.-158+528G>A  12   \n",
       "...      ...      ...          ...                 ...            ...  ..   \n",
       "1417  9101v1    dbSNP  rs672601311        NC_000015.10            NaN  15   \n",
       "1418  9101v1    dbSNP  rs672601311        NC_000015.10            NaN  15   \n",
       "1419  9217v1  OmimVar       605704        NC_000020.11            NaN  20   \n",
       "1420  9217v1  OmimVar       605704        NC_000020.11            NaN  20   \n",
       "1421  9217v1  OmimVar       605704        NC_000020.11            NaN  20   \n",
       "\n",
       "         Start       End RefAllele AltAllele  \n",
       "0     57751648  57751648         G         A  \n",
       "1     57751648  57751648         G         A  \n",
       "2     57751648  57751648         G         A  \n",
       "3     57751647  57751647         C         T  \n",
       "4     57751647  57751647         C         T  \n",
       "...        ...       ...       ...       ...  \n",
       "1417  50490449  50490449         C         G  \n",
       "1418  50490449  50490449         C         T  \n",
       "1419  58418317  58418317         C         T  \n",
       "1420  58418288  58418288         C         G  \n",
       "1421  58418288  58418288         C         T  \n",
       "\n",
       "[2746 rows x 10 columns]"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "final_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "3bec1451-dfd8-4597-b7a7-6b1ec5d70b13",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data.to_csv(\"all_variant_data.tsv\",sep='\\t',index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4723ff8-9848-4f96-8461-2175e986a8f2",
   "metadata": {},
   "source": [
    "In Excel removed duplicates based on the same Variant ID, Chromosome number, ref allele and alt allele\n",
    "\n",
    "After removing 1 lines from manual inspection, I am left with 761 variants and their associated variant ids"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f3c322f-17b8-4ab9-8a8f-a7b60ea73ab0",
   "metadata": {},
   "source": [
    "# Variant ID to Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "48fff44f-22ab-4c30-a24b-2bc029e72463",
   "metadata": {},
   "outputs": [],
   "source": [
    "gene_variant = pd.read_csv(\"gene_variants.tsv\", sep='\\t', names=['Network','ENTRY'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "bba198c9-a63e-466b-871e-b0ee30f84e56",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Network</th>\n",
       "      <th>ENTRY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N00002</td>\n",
       "      <td>25v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N00002</td>\n",
       "      <td>25v2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N00003</td>\n",
       "      <td>3815v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N00004</td>\n",
       "      <td>2322v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N00004</td>\n",
       "      <td>2322v2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>323</th>\n",
       "      <td>N01714</td>\n",
       "      <td>2760v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>324</th>\n",
       "      <td>N01809</td>\n",
       "      <td>5052v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>325</th>\n",
       "      <td>N01873</td>\n",
       "      <td>7428v3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>326</th>\n",
       "      <td>N01876</td>\n",
       "      <td>3084v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>327</th>\n",
       "      <td>N01877</td>\n",
       "      <td>2066v1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>328 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    Network   ENTRY\n",
       "0    N00002    25v1\n",
       "1    N00002    25v2\n",
       "2    N00003  3815v1\n",
       "3    N00004  2322v1\n",
       "4    N00004  2322v2\n",
       "..      ...     ...\n",
       "323  N01714  2760v1\n",
       "324  N01809  5052v1\n",
       "325  N01873  7428v3\n",
       "326  N01876  3084v1\n",
       "327  N01877  2066v1\n",
       "\n",
       "[328 rows x 2 columns]"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gene_variant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "8d2f02f0-bd56-4693-88c7-2f0124a12fa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_variant_data = pd.read_csv(\"all_variant_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "707e21dc-85b0-48da-9d2e-b20c1351d035",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>756</th>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196635</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.706G&gt;T</td>\n",
       "      <td>19</td>\n",
       "      <td>10492196</td>\n",
       "      <td>10492196</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>757</th>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196637</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.548A&gt;G</td>\n",
       "      <td>19</td>\n",
       "      <td>10499486</td>\n",
       "      <td>10499486</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>758</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>759</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766211</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.755T&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68810264</td>\n",
       "      <td>68810264</td>\n",
       "      <td>T</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>760</th>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1379150</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.769G&gt;A</td>\n",
       "      <td>16</td>\n",
       "      <td>68810278</td>\n",
       "      <td>68810278</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>761 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      ENTRY   Source           ID       TranscriptID NucChange  Chr     Start  \\\n",
       "0    1019v2  ClinVar        16929       NC_000012.12       NaN   12  57751646   \n",
       "1    1019v2    dbSNP  rs104894340       NC_000012.12       NaN   12  57751646   \n",
       "2    1019v2    dbSNP  rs104894340       NC_000012.12       NaN   12  57751646   \n",
       "3    1019v2  ClinVar        16928       NC_000012.12       NaN   12  57751647   \n",
       "4    1019v2    dbSNP   rs11547328       NC_000012.12       NaN   12  57751647   \n",
       "..      ...      ...          ...                ...       ...  ...       ...   \n",
       "756  9817v1     COSM      6196635  ENST00000393623.6  c.706G>T   19  10492196   \n",
       "757  9817v1     COSM      6196637  ENST00000393623.6  c.548A>G   19  10499486   \n",
       "758   999v2     COSM      4766271  ENST00000621016.4  c.662A>G   16  68808823   \n",
       "759   999v2     COSM      4766211  ENST00000621016.4  c.755T>G   16  68810264   \n",
       "760   999v2     COSM      1379150  ENST00000621016.4  c.769G>A   16  68810278   \n",
       "\n",
       "          End RefAllele AltAllele  \n",
       "0    57751646         C         T  \n",
       "1    57751646         C         A  \n",
       "2    57751646         C         G  \n",
       "3    57751647         G         A  \n",
       "4    57751647         G         C  \n",
       "..        ...       ...       ...  \n",
       "756  10492196         C         A  \n",
       "757  10499486         T         C  \n",
       "758  68808823         A         G  \n",
       "759  68810264         T         G  \n",
       "760  68810278         G         A  \n",
       "\n",
       "[761 rows x 10 columns]"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_variant_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "fcc506c3-c957-4e8a-acbd-bdb0c9dc6318",
   "metadata": {},
   "outputs": [],
   "source": [
    "variant_data_together_wo_nt = all_variant_data.merge(gene_variant, on=\"ENTRY\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "e679f511-77da-40c1-9f5e-25162fd7f714",
   "metadata": {},
   "outputs": [],
   "source": [
    "variant_data_together_wo_nt.to_csv(\"variant_data_together_wo_nt.tsv\", sep='\\t',index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cf263d2-a41b-422c-b095-4a18184158c6",
   "metadata": {},
   "source": [
    "# Parsing Unique Networks and getting Gene Pathway"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4586fd55-9de0-4d1c-b81f-92bdcce839ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d8f896ab-5859-438b-97f9-392c6f7c837b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     182 network_variant_data_unique.txt\n"
     ]
    }
   ],
   "source": [
    "cut -f 1 variant_data_together_wo_nt.tsv > network_variant_data.txt\n",
    "sort -u network_variant_data.txt > network_variant_data_unique.txt\n",
    "sed -i '' '/Network/d' network_variant_data_unique.txt\n",
    "wc -l network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9a1fa0c9-94a0-40f7-831a-557532512878",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q ENTRY network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3bd437dc-7caa-4b30-9587-84397015be0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q NAME network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "94f49e6f-eea7-4eb3-b495-23bc0593633d",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q DEFINITION network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "baf9f804-fc45-4560-8bbc-fe9e43cebb09",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q EXPANDED network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "931168b5-9a73-4dcb-ad78-fcc41a911503",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N00302\n",
      "N00303\n",
      "N00304\n",
      "N00305\n",
      "N00600\n",
      "N00643\n",
      "N00679\n",
      "N00789\n",
      "N01064\n",
      "N01065\n",
      "N01419\n",
      "N01422\n",
      "N01444\n",
      "N01714\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "    if ! grep -q PATHWAY network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "dd3b8bb6-0b8d-43f9-8241-f093e6b7a063",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q CLASS network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e2d1d5ad-9662-4c7a-abe2-bf155a6e0257",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "N01683\n",
      "N01689\n",
      "N01697\n",
      "N01698\n",
      "N01699\n",
      "N01700\n",
      "N01702\n",
      "N01704\n",
      "N01714\n"
     ]
    }
   ],
   "source": [
    "while read p; do\n",
    "    if ! grep -q DISEASE network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d8eeded0-fda2-4a90-9dcb-b4cb841d77b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q GENE network_variant/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "bad6343d-6c08-4738-9902-ee17f3832b40",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed -i '' '/N01683/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01689/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01697/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01698/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01699/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01700/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01702/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01704/d' network_variant_data_unique.txt\n",
    "sed -i '' '/N01714/d' network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eab3e1bd-3725-4037-839d-ed06e02eff4c",
   "metadata": {},
   "source": [
    "Networks without a disease tag and thus without a ground truth paragraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3a05997d-80bf-4f89-9630-7adf8b6b2866",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     173 network_variant_data_unique.txt\n"
     ]
    }
   ],
   "source": [
    "wc -l network_variant_data_unique.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30578490-ad11-4bed-b683-80fa41f8c41e",
   "metadata": {},
   "source": [
    "**Switch to python**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b58d998-6919-4961-a18e-89a5dfb96d1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6a0904c9-366a-48b5-9291-efc09039478f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "85bd0c1f-cc3a-4fed-94f0-e2dd7d5bb598",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define column structure\n",
    "network_info = pd.DataFrame(columns=[\"Entry\", \"Name\", \"Definition\", \"Expanded\", \"Pathway\", \"Class\", \"Disease\", \"Gene\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "f568e7e9-d28c-44b5-8224-fefbd31735bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read all variant IDs\n",
    "with open('network_variant_data_unique.txt', 'r') as f:\n",
    "    network_var_id = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "# Function to extract single-line values (handles leading whitespace too)\n",
    "def get_single_line_value(lines, key):\n",
    "    for line in lines:\n",
    "        if line.lstrip().startswith(key):\n",
    "            return line.split(key, 1)[-1].strip()\n",
    "    return \"\"\n",
    "\n",
    "# Function to extract multiline values that follow a key line (indented lines)\n",
    "def get_multiline_values(lines, key):\n",
    "    values = []\n",
    "    recording = False\n",
    "    for i, line in enumerate(lines):\n",
    "        if line.startswith(key):\n",
    "            # Capture first line's content after the key\n",
    "            initial_value = line[len(key):].strip()\n",
    "            if initial_value:\n",
    "                values.append(initial_value)\n",
    "            recording = True\n",
    "            continue\n",
    "        if recording:\n",
    "            if re.match(r'^\\s{2,}', line):  # line starts with 2+ spaces\n",
    "                values.append(line.strip())\n",
    "            else:\n",
    "                break  # stop when indentation breaks\n",
    "    return \"| \".join(values)\n",
    "\n",
    "# Process each network_variant file\n",
    "for variant_id in network_var_id:\n",
    "    file_path = f'network_variant/{variant_id}.txt'\n",
    "\n",
    "    try:\n",
    "        with open(file_path, 'r') as f:\n",
    "            lines = f.readlines()\n",
    "\n",
    "        row = {\n",
    "            \"Entry\": variant_id,\n",
    "            \"Name\": get_single_line_value(lines, \"NAME\"),\n",
    "            \"Definition\": get_single_line_value(lines, \"DEFINITION\"),\n",
    "            \"Expanded\": get_single_line_value(lines, \"EXPANDED\"),\n",
    "            \"Pathway\": get_multiline_values(lines, \"PATHWAY\"),\n",
    "            \"Class\": get_multiline_values(lines, \"CLASS\"),\n",
    "            \"Disease\": get_multiline_values(lines, \"DISEASE\"),\n",
    "            \"Gene\": get_multiline_values(lines, \"GENE\")\n",
    "        }\n",
    "\n",
    "        network_info = pd.concat([network_info, pd.DataFrame([row])], ignore_index=True)\n",
    "\n",
    "    except FileNotFoundError:\n",
    "        print(f\"[Warning] File not found: {file_path}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "ed1fef16-cf95-44ef-bfd7-552c631b725e",
   "metadata": {},
   "outputs": [],
   "source": [
    "network_info = network_info.set_index('Entry')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "a2188680-0f9d-4e7f-89f1-9dc5aa95094f",
   "metadata": {},
   "outputs": [],
   "source": [
    "no_pathway = [\"N00302\",\"N00303\",\"N00304\",\"N00305\",\"N00600\",\"N00643\",\"N00679\",\"N00789\",\"N01064\",\"N01065\",\"N01419\",\"N01422\",\"N01444\"]\n",
    "for id in no_pathway:\n",
    "    network_info.at[id, 'Pathway'] = pd.NA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "194ccc3a-54c5-483b-90d7-b5bcda4bdfe4",
   "metadata": {},
   "outputs": [],
   "source": [
    "network_info = network_info.reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "e3d1de46-c7c8-4615-b9ff-2d4d08e979f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Columns to process\n",
    "cols_to_clean = [\"Pathway\", \"Class\", \"Disease\",\"Gene\"]\n",
    "\n",
    "def extract_data(cell):\n",
    "    if pd.isna(cell):\n",
    "        return cell  # Leave NaN as is\n",
    "    gene_dict = {}\n",
    "    for part in cell.split(\"|\"):\n",
    "        tokens = part.strip().split()\n",
    "        if len(tokens) >= 2:\n",
    "            gene_dict[tokens[0]] = ' '.join(tokens[1:])\n",
    "        elif len(tokens) == 1:\n",
    "            gene_dict[tokens[0]] = \"\"\n",
    "    return gene_dict\n",
    "\n",
    "# Apply the transformation to each column\n",
    "for col in cols_to_clean:\n",
    "    network_info[col] = network_info[col].apply(extract_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "72a6992e-def7-4ada-abc6-080c31cec3fe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Entry</th>\n",
       "      <th>Name</th>\n",
       "      <th>Definition</th>\n",
       "      <th>Expanded</th>\n",
       "      <th>Pathway</th>\n",
       "      <th>Class</th>\n",
       "      <th>Disease</th>\n",
       "      <th>Gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N00002</td>\n",
       "      <td>BCR-ABL fusion kinase to RAS-ERK signaling pat...</td>\n",
       "      <td>BCR-ABL -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt;...</td>\n",
       "      <td>(25v1,25v2) -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,38...</td>\n",
       "      <td>{'hsa05220': 'Chronic myeloid leukemia'}</td>\n",
       "      <td>{'nt06276': 'Chronic myeloid leukemia', 'nt062...</td>\n",
       "      <td>{'H00004': 'Chronic myeloid leukemia'}</td>\n",
       "      <td>{'25': 'ABL1; ABL proto-oncogene 1, non-recept...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N00003</td>\n",
       "      <td>Mutation-activated KIT to RAS-ERK signaling pa...</td>\n",
       "      <td>KIT* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>3815v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'3815': 'KIT; KIT proto-oncogene receptor tyr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N00004</td>\n",
       "      <td>Duplication or mutation-activated FLT3 to RAS-...</td>\n",
       "      <td>FLT3* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>(2322v2,2322v1) -&gt; 2885 -&gt; (6654,6655) -&gt; (326...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'2322': 'FLT3; fms related tyrosine kinase 3'...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N00005</td>\n",
       "      <td>Mutation-activated MET to RAS-ERK signaling pa...</td>\n",
       "      <td>MET* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ER...</td>\n",
       "      <td>4233v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05225': 'Hepatocellular carcinoma', 'hsa0...</td>\n",
       "      <td>{'nt06263': 'Hepatocellular carcinoma', 'nt062...</td>\n",
       "      <td>{'H00048': 'Hepatocellular carcinoma', 'H00021...</td>\n",
       "      <td>{'4233': 'MET; MET proto-oncogene, receptor ty...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N00007</td>\n",
       "      <td>EML4-ALK fusion kinase to RAS-ERK signaling pa...</td>\n",
       "      <td>EML4-ALK -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK -&gt; CCND1</td>\n",
       "      <td>(238v1,238v2) -&gt; (3265,3845,4893) -&gt; (369,673,...</td>\n",
       "      <td>{'hsa05223': 'Non-small cell lung cancer'}</td>\n",
       "      <td>{'nt06266': 'Non-small cell lung cancer', 'nt0...</td>\n",
       "      <td>{'H00014': 'Non-small cell lung cancer'}</td>\n",
       "      <td>{'238': 'ALK; ALK receptor tyrosine kinase', '...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>N01422</td>\n",
       "      <td>HPRT1 deficiency in purine salvage pathway</td>\n",
       "      <td>(Hypoxanthine,Guanine) // HPRT1*</td>\n",
       "      <td>(C00262,C00242) // 3251v1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>{'nt06027': 'Purine salvage pathway'}</td>\n",
       "      <td>{'H00194': 'Lesch-Nyhan syndrome'}</td>\n",
       "      <td>{'3251': 'HPRT1; hypoxanthine phosphoribosyltr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169</th>\n",
       "      <td>N01444</td>\n",
       "      <td>NXN mutation to WNT5A-ROR signaling pathway</td>\n",
       "      <td>NXN* -| DVL</td>\n",
       "      <td>64359v1 -| (1855,1856,1857)</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>{'nt06505': 'WNT signaling'}</td>\n",
       "      <td>{'H00485': 'Robinow syndrome'}</td>\n",
       "      <td>{'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>170</th>\n",
       "      <td>N01809</td>\n",
       "      <td>Mutation-caused epigenetic silencing of MMACHC</td>\n",
       "      <td>PRDX1* =| MMACHC</td>\n",
       "      <td>5052v1 =| 25974</td>\n",
       "      <td>{'hsa04980': 'Cobalamin transport and metaboli...</td>\n",
       "      <td>{'nt06538': 'Cobalamin transport and metabolism'}</td>\n",
       "      <td>{'H02221': 'Methylmalonic aciduria and homocys...</td>\n",
       "      <td>{'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>171</th>\n",
       "      <td>N01873</td>\n",
       "      <td>VHL mutation to HIF-2 signaling pathway</td>\n",
       "      <td>(VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =&gt;...</td>\n",
       "      <td>(7428v3+9978+6921+6923+8453) // 2034 == 405 =&gt;...</td>\n",
       "      <td>{'hsa05211': 'Renal cell carcinoma'}</td>\n",
       "      <td>{'nt06542': 'HIF signaling'}</td>\n",
       "      <td>{'H00021': 'Renal cell carcinoma', 'H00559': '...</td>\n",
       "      <td>{'7428': 'VHL; von Hippel-Lindau tumor suppres...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>N01877</td>\n",
       "      <td>ERBB4 mutation to GF-RTK-PI3K signaling pathway</td>\n",
       "      <td>NRG // ERBB4*</td>\n",
       "      <td>(3084,9542,10718,145957) // 2066v1</td>\n",
       "      <td>{'hsa04012': 'ErbB signaling pathway'}</td>\n",
       "      <td>{'nt06543': 'NRG-ERBB signaling'}</td>\n",
       "      <td>{'H00058': 'Amyotrophic lateral sclerosis (ALS)'}</td>\n",
       "      <td>{'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>173 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Entry                                               Name  \\\n",
       "0    N00002  BCR-ABL fusion kinase to RAS-ERK signaling pat...   \n",
       "1    N00003  Mutation-activated KIT to RAS-ERK signaling pa...   \n",
       "2    N00004  Duplication or mutation-activated FLT3 to RAS-...   \n",
       "3    N00005  Mutation-activated MET to RAS-ERK signaling pa...   \n",
       "4    N00007  EML4-ALK fusion kinase to RAS-ERK signaling pa...   \n",
       "..      ...                                                ...   \n",
       "168  N01422         HPRT1 deficiency in purine salvage pathway   \n",
       "169  N01444        NXN mutation to WNT5A-ROR signaling pathway   \n",
       "170  N01809     Mutation-caused epigenetic silencing of MMACHC   \n",
       "171  N01873            VHL mutation to HIF-2 signaling pathway   \n",
       "172  N01877    ERBB4 mutation to GF-RTK-PI3K signaling pathway   \n",
       "\n",
       "                                            Definition  \\\n",
       "0    BCR-ABL -> GRB2 -> SOS -> RAS -> RAF -> MEK ->...   \n",
       "1      KIT* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "2     FLT3* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "3    MET* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ER...   \n",
       "4        EML4-ALK -> RAS -> RAF -> MEK -> ERK -> CCND1   \n",
       "..                                                 ...   \n",
       "168                   (Hypoxanthine,Guanine) // HPRT1*   \n",
       "169                                        NXN* -| DVL   \n",
       "170                                   PRDX1* =| MMACHC   \n",
       "171  (VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =>...   \n",
       "172                                      NRG // ERBB4*   \n",
       "\n",
       "                                              Expanded  \\\n",
       "0    (25v1,25v2) -> 2885 -> (6654,6655) -> (3265,38...   \n",
       "1    3815v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "2    (2322v2,2322v1) -> 2885 -> (6654,6655) -> (326...   \n",
       "3    4233v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "4    (238v1,238v2) -> (3265,3845,4893) -> (369,673,...   \n",
       "..                                                 ...   \n",
       "168                          (C00262,C00242) // 3251v1   \n",
       "169                        64359v1 -| (1855,1856,1857)   \n",
       "170                                    5052v1 =| 25974   \n",
       "171  (7428v3+9978+6921+6923+8453) // 2034 == 405 =>...   \n",
       "172                 (3084,9542,10718,145957) // 2066v1   \n",
       "\n",
       "                                               Pathway  \\\n",
       "0             {'hsa05220': 'Chronic myeloid leukemia'}   \n",
       "1               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "2               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "3    {'hsa05225': 'Hepatocellular carcinoma', 'hsa0...   \n",
       "4           {'hsa05223': 'Non-small cell lung cancer'}   \n",
       "..                                                 ...   \n",
       "168                                               <NA>   \n",
       "169                                               <NA>   \n",
       "170  {'hsa04980': 'Cobalamin transport and metaboli...   \n",
       "171               {'hsa05211': 'Renal cell carcinoma'}   \n",
       "172             {'hsa04012': 'ErbB signaling pathway'}   \n",
       "\n",
       "                                                 Class  \\\n",
       "0    {'nt06276': 'Chronic myeloid leukemia', 'nt062...   \n",
       "1    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "2    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "3    {'nt06263': 'Hepatocellular carcinoma', 'nt062...   \n",
       "4    {'nt06266': 'Non-small cell lung cancer', 'nt0...   \n",
       "..                                                 ...   \n",
       "168              {'nt06027': 'Purine salvage pathway'}   \n",
       "169                       {'nt06505': 'WNT signaling'}   \n",
       "170  {'nt06538': 'Cobalamin transport and metabolism'}   \n",
       "171                       {'nt06542': 'HIF signaling'}   \n",
       "172                  {'nt06543': 'NRG-ERBB signaling'}   \n",
       "\n",
       "                                               Disease  \\\n",
       "0               {'H00004': 'Chronic myeloid leukemia'}   \n",
       "1                 {'H00003': 'Acute myeloid leukemia'}   \n",
       "2                 {'H00003': 'Acute myeloid leukemia'}   \n",
       "3    {'H00048': 'Hepatocellular carcinoma', 'H00021...   \n",
       "4             {'H00014': 'Non-small cell lung cancer'}   \n",
       "..                                                 ...   \n",
       "168                 {'H00194': 'Lesch-Nyhan syndrome'}   \n",
       "169                     {'H00485': 'Robinow syndrome'}   \n",
       "170  {'H02221': 'Methylmalonic aciduria and homocys...   \n",
       "171  {'H00021': 'Renal cell carcinoma', 'H00559': '...   \n",
       "172  {'H00058': 'Amyotrophic lateral sclerosis (ALS)'}   \n",
       "\n",
       "                                                  Gene  \n",
       "0    {'25': 'ABL1; ABL proto-oncogene 1, non-recept...  \n",
       "1    {'3815': 'KIT; KIT proto-oncogene receptor tyr...  \n",
       "2    {'2322': 'FLT3; fms related tyrosine kinase 3'...  \n",
       "3    {'4233': 'MET; MET proto-oncogene, receptor ty...  \n",
       "4    {'238': 'ALK; ALK receptor tyrosine kinase', '...  \n",
       "..                                                 ...  \n",
       "168  {'3251': 'HPRT1; hypoxanthine phosphoribosyltr...  \n",
       "169  {'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...  \n",
       "170  {'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...  \n",
       "171  {'7428': 'VHL; von Hippel-Lindau tumor suppres...  \n",
       "172  {'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...  \n",
       "\n",
       "[173 rows x 8 columns]"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "network_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "4844c8c8-7efc-4b21-ba7a-bd6eef0a7cf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "network_info.to_csv(\"network_variant_final_info.tsv\",sep='\\t', header=True, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "c432ed92-f45d-4893-8666-a71fa6256076",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['H00003', 'H00004', 'H00013', 'H00014', 'H00018', 'H00019', 'H00020', 'H00021', 'H00022', 'H00024', 'H00026', 'H00031', 'H00032', 'H00033', 'H00034', 'H00038', 'H00039', 'H00042', 'H00048', 'H00056', 'H00057', 'H00058', 'H00059', 'H00061', 'H00063', 'H00126', 'H00135', 'H00194', 'H00195', 'H00246', 'H00247', 'H00251', 'H00260', 'H00423', 'H00485', 'H00559', 'H01032', 'H01102', 'H01398', 'H01431', 'H01522', 'H01603', 'H02049', 'H02221']\n"
     ]
    }
   ],
   "source": [
    "all_disease_keys = []\n",
    "\n",
    "for disease in network_info['Disease']:\n",
    "    if isinstance(disease, dict):\n",
    "        all_disease_keys.extend(disease.keys())\n",
    "\n",
    "unique_disease_keys = sorted(set(all_disease_keys))\n",
    "print(unique_disease_keys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "9b0a7a42-fef6-4300-9272-3973be631880",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "\n",
    "disease_dict = {}\n",
    "\n",
    "for disease in unique_disease_keys:\n",
    "    try:\n",
    "        # Run the shell command and capture output\n",
    "        result = subprocess.run(\n",
    "            f\"kegg_pull rest get {disease} | grep DESCRIPTION\",\n",
    "            shell=True,\n",
    "            capture_output=True,\n",
    "            text=True\n",
    "        )\n",
    "        # Save the stdout (if grep found something)\n",
    "        if result.stdout:\n",
    "            disease_dict[disease] = result.stdout.strip()\n",
    "        else:\n",
    "            disease_dict[disease] = None  # or \"DESCRIPTION not found\"\n",
    "    except Exception as e:\n",
    "        disease_dict[disease] = f\"Error: {str(e)}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "7ce66864-aa4d-47f3-9843-b2a96d2e188b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'H00003': 'DESCRIPTION Acute myeloid leukemia (AML) is a disease that is characterized by uncontrolled proliferation of clonal neoplastic cells and accumulation in the bone marrow of blasts with an impaired differentiation program. AML accounts for approximately 80% of all adult leukemias and remains the most common cause of leukemia death. Two major types of genetic events have been described that are crucial for leukemic transformation. A proposed necessary first event is disordered cell growth and upregulation of cell survival genes. The most common of these activating events were observed in the RTK Flt3, in N-Ras and K-Ras, in Kit, and sporadically in other RTKs. Alterations in myeloid transcription factors governing hematopoietic differentiation provide second necessary event for leukemogenesis. Transcription factor fusion proteins such as PML-RARalpha (in Acute promyelocytic leukemia, a subtype of AML), AML-ETO or PLZF-RARalpha block myeloid cell differentiation by repressing target genes. In other cases, the transcription factors themselves are mutated.',\n",
       " 'H00004': 'DESCRIPTION Chronic myeloid leukemia (CML) is a clonal myeloproliferative disorder of a pluripotent stem cell. The natural history of CML has a triphasic clinical course comprising of an initial chronic phase (CP), which is characterized by expansion of functionally normal myeloid cells, followed by an accelerated phase (AP) and finally a more aggressive blast phase (BP), with loss of terminal differentiation capacity. On the cellular level, CML is associated with a specific chromosome abnormality, the t(9; 22) reciprocal translocation that forms the Philadelphia (Ph) chromosome. The Ph chromosome is the result of a molecular rearrangement between the c-ABL proto-oncogene on chromosome 9 and the BCR (breakpoint cluster region) gene on chromosome 22. The BCR/ABL fusion gene encodes p210 BCR/ABL, an oncoprotein, which, unlike the normal p145 c-Abl, has constitutive tyrosine kinase activity and is predominantly localized in the cytoplasm. While fusion of c-ABL and BCR is believed to be the primary cause of the chronic phase of CML, progression to blast crisis requires other molecular changes. Common secondary abnormalities include mutations in TP53, RB, and p16/INK4A, or overexpression of genes such as EVI1. Additional chromosome translocations are also observed,such as t(3;21)(q26;q22), which generates AML1-EVI1.',\n",
       " 'H00013': 'DESCRIPTION Lung cancer is a leading cause of cancer death among men and women in industrialized countries. Small cell lung carcinoma (SCLC) is a highly aggressive neoplasm, which accounts for approximately 25% of all lung cancer cases. Molecular mechanisms altered in SCLC include induced expression of oncogene, MYC, and loss of tumorsuppressor genes, such as p53, PTEN, RB, and FHIT. The overexpression of MYC proteins in SCLC is largely a result of gene amplification. Such overexpression leads to more rapid proliferation and loss of terminal differentiation. Mutation or deletion of p53 or PTEN can lead to more rapid proliferation and reduced apoptosis. The retinoblastoma gene RB1 encodes a nuclear phosphoprotein that helps to regulate cell-cycle progression. The fragile histidine triad gene FHIT encodes the enzyme diadenosine triphosphate hydrolase, which is thought to have an indirect role in proapoptosis and cell-cycle control.',\n",
       " 'H00014': 'DESCRIPTION Lung cancer is a leading cause of cancer death among men and women in industrialized countries. Non-small-cell lung cancer (NSCLC) accounts for approximately 85% of lung cancer and represents a heterogeneous group of cancers, consisting mainly of squamous cell (SCC), adeno (AC) and large-cell carcinoma. Molecular mechanisms altered in NSCLC include activation of oncogenes, such as K-RAS, EGFR and EML4-ALK, and inactivation of tumorsuppressor genes, such as p53, p16INK4a, RAR-beta, and RASSF1. Point mutations within the K-RAS gene inactivate GTPase activity and the p21-RAS protein continuously transmits growth signals to the nucleus. Mutations or overexpression of EGFR leads to a proliferative advantage. EML4-ALK fusion leads to constitutive ALK activation, which causes cell proliferation, invasion, and inhibition of apoptosis. Inactivating mutation of p53 can lead to more rapid proliferation and reduced apoptosis. The protein encoded by the p16INK4a inhibits formation of CDK-cyclin-D complexes by competitive binding of CDK4 and CDK6. Loss of p16INK4a expression is a common feature of NSCLC. RAR-beta is a nuclear receptor that bears vitamin-A-dependent transcriptional activity. RASSF1A is able to form heterodimers with Nore-1, an RAS effector. Therefore loss of RASSF1A might shift the balance of RAS activity towards a growth-promoting effect.',\n",
       " 'H00018': \"DESCRIPTION Gastric cancer (GC) is one of the world's most common cancers. According to Lauren's histological classification gastric cancer is divided into two distinct histological groups - the intestinal and diffuse types. Several genetic changes have been identified in intestinal-type GC. The intestinal metaplasia is characterized by mutations in p53 gene, reduced expression of retinoic acid receptor beta (RAR-beta) and hTERT expression. Gastric adenomas furthermore display mutations in the APC gene, reduced p27 expression and cyclin E amplification. In addition, amplification and overexpression of c-ErbB2, reduced TGF-beta receptor type I (TGFBRI) expression and complete loss of p27 expression are commonly observed in more advanced GC. The main molecular changes observed in diffuse-type GCs include loss of E-cadherin function by mutations in CDH1and amplification of MET and FGFR2F.\",\n",
       " 'H00019': \"DESCRIPTION Infiltrating ductal adenocarcinoma is the most common malignancy of the pancreas. When most investigators use the term 'pancreatic cancer' they are referring to pancreatic ductal adenocarcinoma (PDA). Normal duct epithelium progresses to infiltrating cancer through a series of histologically defined precursors. The overexpression of HER-2/neu and activating point mutations in the K-ras gene occur early, inactivation of the p16 gene at an intermediate stage, and the inactivation of p53, SMAD4, and BRCA2 occur relatively late. Activated K-ras engages multiple effector pathways. Although EGF receptors are conventionally regarded as upstream activators of RAS proteins, they can also act as RAS signal transducers via RAS-induced autocrine activation of the EGFR family ligands. Moreover, PDA shows extensive genomic instability and aneuploidy. Telomere attrition and mutations in p53 and BRCA2 are likely to contribute to these phenotypes. Inactivation of the SMAD4 tumour suppressor gene leads to loss of the inhibitory influence of the transforming growth factor-beta signalling pathway.\",\n",
       " 'H00020': 'DESCRIPTION Colorectal cancer (CRC) is the second largest cause of cancer-related deaths in Western countries. CRC arises from the colorectal epithelium as a result of the accumulation of genetic alterations in defined oncogenes and tumour suppressor genes (TSG). Two major mechanisms of genomic instability have been identified in sporadic CRC progression. The first, known as chromosomal instability (CIN), results from a series of genetic changes that involve the activation of oncogenes such as K-ras and inactivation of TSG such as p53, DCC/Smad4, and APC. The second, known as microsatellite instability (MSI), results from inactivation of the DNA mismatch repair genes MLH1 and/or MSH2 by hypermethylation of their promoter, and secondary mutation of genes with coding microsatellites, such as transforming growth factor receptor II (TGF-RII) and BAX. Hereditary syndromes have germline mutations in specific genes (mutation in the tumour suppressor gene APC on chromosome 5q in FAP, mutated DNA mismatch repair genes in HNPCC).',\n",
       " 'H00021': 'DESCRIPTION Renal cell cancer (RCC) accounts for ~3% of human malignancies and its incidence appears to be rising. Although most cases of RCC seem to occur sporadically, an inherited predisposition to renal cancer accounts for 1-4% of cases. RCC is not a single disease, it has several morphological subtypes. Conventional RCC (clear cell RCC) accounts for ~80% of cases, followed by papillary RCC (10-15%), chromophobe RCC (5%), and collecting duct RCC (<1%). Genes potentially involved in sporadic neoplasms of each particular type are VHL, MET, BHD, and FH respectively. In the absence of VHL, hypoxia-inducible factor alpha (HIF-alpha) accumulates, leading to production of several growth factors, including vascular endothelial growth factor and platelet-derived growth factor. Activated MET mediates a number of biological effects including motility, invasion of extracellular matrix, cellular transformation, prevention of apoptosis and metastasis formation. Loss of functional FH leads to accumulation of fumarate in the cell, triggering inhibition of HPH and preventing targeted pVHL-mediated degradation of HIF-alpha. BHD mutations cause the Birt-Hogg-Dube syndrome and its associated chromophobe, hybrid oncocytic, and conventional (clear cell) RCC.',\n",
       " 'H00022': 'DESCRIPTION The urothelium covers the luminal surface of almost the entire urinary tract, extending from the renal pelvis, through the ureter and bladder, to the proximal urethra. The majority of urothelial carcinoma are bladder carcinomas, and urothelial carcinomas of the renal pelvis and ureter account for only approximately 7% of the total. Urothelial tumours arise and evolve through divergent phenotypic pathways. Some tumours progress from urothelial hyperplasia to low-grade non-invasive superficial papillary tumours. More aggressive variants arise either from flat, high-grade carcinoma in situ (CIS) and progress to invasive tumours, or they arise de novo as invasive tumours. Low-grade papillary tumors frequently show a constitutive activation of the receptor tyrosine kinase-Ras pathway, exhibiting activating mutations in the HRAS and fibroblast growth factor receptor 3 (FGFR3) genes. In contrast, CIS and invasive tumors frequently show alterations in the TP53 and RB genes and pathways. Invasion and metastases are promoted by several factors that alter the tumour microenvironment, including the aberrant expression of E-cadherins (E-cad), matrix metalloproteinases (MMPs), angiogenic factors such as vascular endothelial growth factor (VEGF).',\n",
       " 'H00024': 'DESCRIPTION Prostate cancer constitutes a major health problem in Western countries. It is the most frequently diagnosed cancer among men and the second leading cause of male cancer deaths. The identification of key molecular alterations in prostate-cancer cells implicates carcinogen defenses (GSTP1), growth-factor-signaling pathways (NKX3.1, PTEN, and p27), and androgens (AR) as critical determinants of the phenotype of prostate-cancer cells. Glutathione S-transferases (GSTP1) are detoxifying enzymes. Cells of prostatic intraepithelial neoplasia, devoid of GSTP1, undergo genomic damage mediated by carcinogens. NKX3.1, PTEN, and p27 regulate the growth and survival of prostate cells in the normal prostate. Inadequate levels of PTEN and NKX3.1 lead to a reduction in p27 levels and to increased proliferation and decreased apoptosis. Androgen receptor (AR) is a transcription factor that is normally activated by its androgen ligand. During androgen withdrawal therapy, the AR signal transduction pathway also could be activated by amplification of the AR gene, by AR gene mutations, or by altered activity of AR coactivators. Through these mechanisms, tumor cells lead to the emergence of androgen-independent prostate cancer.',\n",
       " 'H00026': 'DESCRIPTION Endometrial cancer (EC) is the most common gynaecological malignancy and the fourth most common malignancy in women in the developed world after breast, colorectal and lung cancer. Two types of endometrial carcinoma are distinguished with respect to biology and clinical course. Type-I carcinoma is related to hyperestrogenism by association with endometrial hyperplasia, frequent expression of estrogen and progesterone receptors and younger age, whereas type-II carcinoma is unrelated to estrogen, associated with atrophic endometrium, frequent lack of estrogen and progesterone receptors and older age. The morphologic differences in these cancers are mirrored in their molecular genetic profile with type I showing defects in DNA-mismatch repair and mutations in PTEN, K-ras, and beta-catenin, and type II showing aneuploidy, p53 mutations, and her2/neu amplification.',\n",
       " 'H00031': 'DESCRIPTION Breast cancer is the leading cause of cancer death among women worldwide. The vast majority of breast cancers are carcinomas that originate from cells lining the milk-forming ducts of the mammary gland. The molecular subtypes of breast cancer, which are based on the presence or absence of hormone receptors (estrogen and progesterone subtypes) and human epidermal growth factor receptor-2 (HER2), include: hormone receptor positive and HER2 negative (luminal A subtype), hormone receptor positive and HER2 positive (luminal B subtype), hormone receptor negative and HER2 positive (HER2 positive), and hormone receptor negative and HER2 negative (basal-like or triple-negative breast cancers (TNBCs)). Hormone receptor positive breast cancers are largely driven by the estrogen/ER pathway. In HER2 positive breast tumours, HER2 activates the PI3K/AKT and the RAS/RAF/MAPK pathways, and stimulate cell growth, survival and differentiation. In patients suffering from TNBC, the deregulation of various signalling pathways (Notch, Wnt/beta-catenin, and EGFR) have been confirmed.',\n",
       " 'H00032': 'DESCRIPTION Thyroid cancer is the most common endocrine malignancy and accounts for the majority of endocrine cancer- related deaths each year. More than 95% of thyroid carcinomas are derived from follicular cells. Their behavior varies from the indolent growing, well-differentiated papillary and follicular carcinomas (PTC and FTC, respectively) to the extremely aggressive undifferentiated carcinoma (UC). Somatic rearrangements of RET and TRK are almost exclusively found in PTC and may be found in early stages. The most distinctive molecular features of FTC are the prominence of aneuploidy and the high prevalence of RAS mutations and PAX8-PPAR{gamma} rearrangements. p53 seems to play a crucial role in the dedifferentiation process of thyroid carcinoma.',\n",
       " 'H00033': 'DESCRIPTION Adrenocortical carcinoma (ACC) is a rare endocrine malignancy defined by a heterogeneous clinical presentation, dismal prognosis, and lack of effective therapeutic regimens. The incidence of ACC ranges from 0.5 to 2 cases per million people per year, accounting for 0.02% of all reported cancers. Unfortunately, most patients present with metastatic disease which reduces the 5 year survival rate to less than 10%. Oncogenes and tumor-suppressor genes involved in adrenal carcinomas include mutations in the p53 tumor-suppressor gene and rearrangements of the chromosomal locus 11p15.5 associated with IGF II hyperexpression. Deletions of the ACTH receptor gene have recently been found in undifferentiated adenomas and in aggressive ACCs.',\n",
       " 'H00034': 'DESCRIPTION Carcinoid tumors are relatively uncommon neoplasms that nonetheless comprise up to 85% of neuroendocrine gastrointestinal neoplasms. They most frequently occur in the midgut and develop from neuroendocrine cells that are normally and diffusely present in this location. Most carcinoids are sporadic but epidemiological studies report a familial risk. Moreover, carcinoids can occur within the multiple endocrine neoplasia (MEN) syndrome, a rare familiar tumor syndrome in which mutations in the MEN1 gene are manifested. Recently, it has been shown that a majority (78%) of sporadic carcinoids display loss of heterozygosity for markers around the MEN 1 region, thus suggesting involvement of this gene in the pathogenesis of both familial and sporadic carcinoids.',\n",
       " 'H00038': 'DESCRIPTION Melanoma is a form of skin cancer that has a poor prognosis and which is on the rise in Western populations. Melanoma arises from the malignant transformation of pigment-producing cells, melanocytes. The only known environmental risk factor is exposure to ultraviolet (UV) light and in people with fair skin the risk is greatly increased. Melanoma pathogenesis is also driven by genetic factors. Oncogenic NRAS mutations activate both effector pathways Raf-MEK-ERK and PI3K-Akt. The Raf-MEK-ERK pathway may also be activated via mutations in the BRAF gene. The PI3K-Akt pathway may be activated through loss or mutation of the inhibitory tumor suppressor gene PTEN. These mutations arise early during melanoma pathogenesis and are preserved throughout tumor progression. Melanoma development has been shown to be strongly associated with inactivation of the p16INK4a/cyclin dependent kinases 4 and 6/retinoblastoma protein (p16INK4a/CDK4,6/pRb) and p14ARF/human double minute 2/p53 (p14ARF/HMD2/p53) tumor suppressor pathways. MITF and TP53 are implicated in further melanoma progression.',\n",
       " 'H00039': 'DESCRIPTION Cancer of the skin is the most common cancer in Caucasians and basal cell carcinomas (BCC) account for 90% of all skin cancers. The vast majority of BCC cases are sporadic, though there is a rare familial syndrome basal cell nevus syndrome (BCNS, or Gorlin syndrome) that predisposes to development of BCC. In addition, there is strong epidemiological and genetic evidence that demonstrates UV exposure as a risk factor of prime importance. The development of basal cell carcinoma is associated with constitutive activation of sonic hedgehog signaling. The mutations in SMOH, PTCH1, and SHH in BCCs result in continuous activation of target genes. At a cellular level, sonic hedgehog signaling promotes cell proliferation. Mutations in TP53 are also found with high frequency (>50%) in sporadic BCC.',\n",
       " 'H00042': 'DESCRIPTION Gliomas are the most common of the primary brain tumors and account for more than 40% of all central nervous system neoplasms. Gliomas include tumours that are composed predominantly of astrocytes (astrocytomas), oligodendrocytes (oligodendrogliomas), mixtures of various glial cells (for example,oligoastrocytomas) and ependymal cells (ependymomas). The most malignant form of infiltrating astrocytoma - glioblastoma multiforme (GBM) - is one of the most aggressive human cancers. GBM may develop de novo (primary glioblastoma) or by progression from low-grade or anaplastic astrocytoma (secondary glioblastoma). Primary glioblastomas develop in older patients and typically show genetic alterations (EGFR amplification, p16/INK4a deletion, and PTEN mutations) at frequencies of 24-34%. Secondary glioblastomas develop in younger patients and frequently show overexpression of PDGF and CDK4 as well as p53 mutations (65%) and loss of Rb playing major roles in such transformations. Loss of PTEN has been implicated in both pathways, although it is much more common in the pathogenesis of primary GBM.',\n",
       " 'H00048': 'DESCRIPTION Hepatocellular carcinoma (HCC) is a major type of primary liver cancer and one of the rare human neoplasms etiologically linked to viral factors. It has been shown that, after HBV/HCV infection and alcohol or aflatoxin B1 exposure, genetic and epigenetic changes occur. The recurrent mutated genes were found to be highly enriched in multiple key driver signaling processes, including telomere maintenance, TP53, cell cycle regulation, the Wnt/beta-catenin pathway (CTNNB1 and AXIN1), the phosphatidylinositol-3 kinase (PI3K)/AKT/mammalian target of rapamycin (mTOR) pathway. Recent studies using whole-exome sequencing have revealed recurrent mutations in new driver genes involved in the chromatin remodelling (ARID1A and ARID2) and the oxidative stress (NFE2L2) pathways.',\n",
       " 'H00056': 'DESCRIPTION Alzheimer disease (AD) is a chronic disorder that slowly destroys neurons and causes serious cognitive disability. AD is associated with senile plaques and neurofibrillary tangles (NFTs). Amyloid-beta (Abeta), a major component of senile plaques, has various pathological effects on cell and organelle function. To date genetic studies have revealed four genes that may be linked to autosomal dominant or familial early onset AD (FAD). These four genes include: amyloid precursor protein (APP), presenilin 1 (PS1), presenilin 2 (PS2), and apolipoprotein E (ApoE). All mutations associated with APP and PS proteins can lead to an increase in the production of Abeta peptides, specifically the more amyloidogenic form, Abeta42. It was proposed that Abeta forms Ca2+ permeable pores and binds to and modulates multiple synaptic proteins, including NMDAR, mGluR5, and VGCC, leading to the overfilling of neurons with calcium ions. Consequently, cellular Ca2+ disruptions will lead to neuronal apoptosis, autophagy deficits, mitochondrial abnormality, defective neurotransmission, impaired synaptic plasticity, and neurodegeneration in AD. FAD-linked PS1 mutation downregulates the unfolded protein response and leads to vulnerability to ER stress.',\n",
       " 'H00057': 'DESCRIPTION Parkinson disease (PD) is a progressive neurodegenerative movement disorder that results primarily from the death of dopaminergic (DA) neurons in the substantia nigra pars compacta (SNc). Both environmental factors and mutations in familial PD-linked genes such as SNCA, Parkin, DJ-1, PINK1 and LRRK2 are associated with PD pathogenesis. These pathogenic mutations and environmental factors are known to cause disease due to oxidative stress, intracellular Ca2+ homeostasis impairment, mitochondrial dysfunctions and altered protein handling compromising key roles of DA neuronal function and survival. The demise of DA neurons located in the SNc leads to a drop in the dopaminergic input to the striatum, which is hypothesized to impede movement by inducing hypo and hyper activity in striatal spiny projection neurons (SPNs) of the direct (dSPNs) and indirect (iSPNs) pathways in the basal ganglia, respectively.',\n",
       " 'H00058': 'DESCRIPTION Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder characterized by a progressive degeneration of motor neurons in the brain and spinal cord. In 90% of patients, ALS is sporadic, with no clear genetic linkage. On the other hand, the remaining 10% of cases show familial inheritance, with mutations in SOD1, TDP43(TARDBP), FUS, or C9orf72 genes being the most frequent causes. In spite of such difference, familial ALS and sporadic ALS have similarities in their pathological features. Proposed disease mechanisms contributing to motor neuron degeneration in ALS are: impaired proteostasis, aberrant RNA processing, mitochondrial disfunction and oxidative stress, microglia activation, and axonal dysfunction.',\n",
       " 'H00059': 'DESCRIPTION Huntington disease (HD) is an autosomal-dominant neurodegenerative disorder that primarily affects medium spiny striatal neurons (MSN). The symptoms are choreiform, involuntary movements, personality changes and dementia. HD is caused by a CAG repeat expansion in the IT15 gene, which results in a long stretch of polyglutamine (polyQ) close to the amino-terminus of the HD protein huntingtin (Htt). Mutant Htt (mHtt) has effects both in the cytoplasm and in the nucleus. Full-length Htt is cleaved by proteases in the cytoplasm, leading to the formation of cytoplasmic and neuritic aggregates. mHtt also alters vesicular transport and recycling, causes cytosolic and mitochondrial Ca2+  overload, triggers endoplasmic reticulum stress through proteasomal dysfunction, and impairs autophagy function, increasing neuronal death susceptibility. N-terminal fragments containing the polyQ stretch translocate to the nucleus where they impair transcription and induce neuronal death.',\n",
       " 'H00061': 'DESCRIPTION Prion diseases, also termed transmissible spongiform encephalopathies (TSEs), are a group of fatal neurodegenerative diseases that affect humans and a number of other animal species. The etiology of these diseases is thought to be associated with the conversion of a normal protein, PrPC, into an infectious, pathogenic form, PrPSc. The conversion is induced by prion infections (for example, variant Creutzfeldt-Jakob disease (vCJD), iatrogenic CJD, Kuru), mutations (familial CJD, Gerstmann-Straussler-Scheinker syndrome, fatal familial insomnia (FFI)) or unknown factors (sporadic CJD (sCJD)), and is thought to occur after PrPC has reached the plasma membrane or is re-internalized for degradation. The PrPSc form shows greater protease resistance than PrPC and accumulates in affected individuals, often in the form of extracellular plaques. Pathways that may lead to neuronal death comprise oxidative stress, regulated activation of complement, ubiquitin-proteasome and endosomal-lysosomal systems, synaptic alterations and dendritic atrophy, corticosteroid response, and endoplasmic reticulum stress. In addition, the conformational transition could lead to the lost of a beneficial activity of the natively folded protein, PrPC.',\n",
       " 'H00063': 'DESCRIPTION The autosomal dominant spinocerebellar ataxias (SCAs) are a group of progressive neurodegenerative diseases characterised by loss of balance and motor coordination due to the primary dysfunction of the cerebellum. Compelling evidence points to major aetiological roles for transcriptional dysregulation, protein aggregation and clearance, autophagy, the ubiquitin-proteasome system, alterations of calcium homeostasis, mitochondria defects, toxic RNA gain-of-function mechanisms and eventual cell death with apoptotic features of neurons during SCA disease progression.',\n",
       " 'H00126': 'DESCRIPTION Gaucher disease is an autosomal recessive lysosomal storage disorder caused by deficient beta-glucocerebrosidase (glucosylceramidase) activity or saposin C which is an activator of beta-glucocerebrosidase in sphingolipid metabolism. The enzymatic defects lead to the accumulation of glucosylceramide (GC) in lysosomes of affected cells. Despite the fact that Gaucher Disease consists of a phenotype, with varying degrees of severity, it has been sub-divided in three subtypes according to the presence or absence of neurological involvement. The sub-types are Type 1, 2 and 3.',\n",
       " 'H00135': 'DESCRIPTION Krabbe disease is an autosomal recessive disorder caused by deficient activity of galactosylceramidase.',\n",
       " 'H00194': 'DESCRIPTION Deficiency of hypoxanthine-guanine phosphoribosyltransferase activity is an inborn error of purine metabolism characterized by hyperuricemia with hyperuricosuria and a continuum spectrum of neurological manifestations.',\n",
       " 'H00195': 'DESCRIPTION Adenine phosphoribosyltransferase deficiency (APRTD) is an autosomal recessive disorder of purine metabolism and causes urolithiasis due to accumulation of the insoluble purine 2,8-dihydroxyadenine.',\n",
       " 'H00246': 'DESCRIPTION Familial hyperparathyroidism (HRPT) is characterized by parathyroid adenoma and hyperplasia with hypersecretion of parathyroid hormone and hypercalcaemia. It is caused by mutation in the HRPT2 (CDC73 or Parafibromin) gene that also causes the hyperparathyroidism-jaw tumor syndrome. Sporadic cases are also known to occur with somatic mutations within the MEN1 gene.',\n",
       " 'H00247': \"DESCRIPTION Multiple endocrine neoplasias (MEN) are autosomal dominant syndrome which is characterized by the occurrence of tumors involving two or more endocrine glands. Four major forms of MEN are recognized, namely MEN1, MEN2A, MEN2B and MEN4. MEN1, which is also referred as Wermer's syndrome, is characterized by parathyroid adenoma, gastrinoma, and pituitary adenoma. Gastrinomas are the most common type, leading to the Zollinger-Ellison Syndrome (see H01522). MEN2 is characterized by medullary thyroid cancer (MTC) and includes three subtypes: MEN2A (Sipple's syndrome), MEN2B (MEN3) and familial MTC. Patients with MEN2A develop MTC in association with phaeochromocytoma and parathyroid tumors. Patients with MEN2B develop MTC in association with marfanoid habitus, mucosal neuromas, medullated corneal fibers and intestinal autonomic ganglion dysfunction, leading to megacolon. MEN4, also referred to as MENX, appears to have signs and symptoms similar to those of type 1. However MEN4 patients have mutations in other genes. The mutations in their responsible genes are found in Each MEN syndrome.\",\n",
       " 'H00251': 'DESCRIPTION Thyroid dyshormonogenesis is a genetically heterogeneous group of inherited disorders in the enzymatic cascade of thyroid hormone synthesis that result in congenital hypothyroidism due to genetic defects in the synthesis of thyroid hormones.',\n",
       " 'H00260': \"DESCRIPTION Primary pigmented micronodular adrenocortical disease (PPNAD) is a form of ACTH-independent adrenal hyperplasia resulting in endogenous Cushing's syndrome.\",\n",
       " 'H00423': 'DESCRIPTION The sphingolipidoses are a group of monogenic inherited diseases caused by defects in the system of lysosomal sphingolipid degradation, with subsequent accumulation of non-degradable storage material in one or more organs.',\n",
       " 'H00485': 'DESCRIPTION Robinow syndrome (RS) is a rare genetically heterogeneous condition characterized by hypertelorism, nasal features (large nasal bridge, short upturned nose, and anteverted nares), midface hypoplasia, mesomelic limb shortening, brachydactyly, clinodactyly, micropenis, and short stature. Both autosomal recessive and autosomal dominant inheritance have been described. The phenotypic presentation in both types of RS overlaps; however, subtle variances in the severity of craniofacial, musculoskeletal, cardiovascular, and urogenital characteristics may be present. In general, autosomal recessive RS (RRS) patients have more severe dysmorphology than autosomal dominant RS (DRS), especially in the musculoskeletal system.',\n",
       " 'H00559': 'DESCRIPTION von Hippel-Lindau syndrome is an autosomal dominant disorder associated with tumors in the central nervous system and other organs. The most frequent tumors are cerebellar and retinal haemangioblastomas, pancreatic neuroendocrine tumors, renal cell carcinoma, phaeochromocytoma in the adrenal gland, epididymal cystadenoma, and endolymphatic sac tumors. Germline inactivation of VHL tumor suppressor protein leads to the upregulation of HIF and promotes to carcinogenesis.',\n",
       " 'H01032': 'DESCRIPTION N-acetylglutamate synthase (NAGS) deficiency is a rare inborn error of metabolism affecting ammonia detoxification in the urea cycle. The N-acetylglutamate is the absolutely required allosteric activator of the first urea cycle enzyme carbamoylphosphate synthetase 1 (CPS1). In defects of NAGS, the urea cycle function can be severely affected resulting in fatal hyperammonemia in neonatal patients or at any later stage in life. Clinical features of NAGS deficiency include poor feeding, vomiting, altered level of consciousness, seizures, and coma.',\n",
       " 'H01102': 'DESCRIPTION Pituitary adenomas are an important and frequently occurring form of intracranial tumor. They are usually benign but can give rise to severe clinical syndromes due to hormonal excess, or to visual/cranial disturbances due to mass effect. The tumor can be clinically nonfunctioning or hormone secreting. Among the latter, prolactin (PRL) and growth hormone (GH)-secreting adenomas are the most common. The majority of pituitary adenomas arise sporadically, although a subset occurs as component tumors of well-characterized familial cancer syndromes, such as multiple endocrine neoplasia (MEN) [DS:H00247], and Carney complex (CNC) [DS:H01820].',\n",
       " 'H01398': 'DESCRIPTION Hyperammonemia is a metabolic condition characterized by elevated levels of ammonia in the blood, and may result in irreversible brain damage if not treated early and thoroughly. Hyperammonemia can be classified into primary or secondary hyperammonemia depending on the underlying pathophysiology. Detoxification of ammonia is mainly accomplished by the urea cycle in periportal hepatocytes. If the urea cycle is directly affected by a defect of any of the involved enzymes or transporters, this results in primary hyperammonemia.',\n",
       " 'H01431': \"DESCRIPTION Cushing syndrome (CS) is a rare disorder resulting from prolonged exposure to excess glucocorticoids via exogenous and endogenous sources. The typical clinical features of CS are related to hypercortisolism and include accumulation of central fat, moon facies, neuromuscular weakness, osteoporosis or bone fractures, metabolic complications, and mood changes. Traditionally, endogenous CS is classified as adrenocorticotropic hormone (ACTH)-dependent (about 80%) or ACTH- independent (about 20%). Among ACTH-dependent forms, pituitary corticotroph adenoma (Cushing's disease) is most common. Most pituitary tumors are sporadic, resulting from monoclonal expansion of a single mutated cell. Recently recurrent activating somatic driver mutations in the ubiquitin-specific protease 8 gene (USP8) were identified in almost half of corticotroph adenoma. Germline mutations in MEN1 (encoding menin), AIP (encoding aryl-hydrocarbon receptor-interacting protein), PRKAR1A (encoding cAMP-dependent protein kinase type I alpha regulatory subunit) and CDKN1B (encoding cyclin-dependent kinase inhibitor 1B; also known as p27 Kip1) have been identified in familial forms of pituitary adenomas. However, the frequency of familial pituitary adenomas is less than 5% in patients with pituitary adenomas. Among ACTH-independent CS, adrenal adenoma is most common. Rare adrenal causes of CS include primary bilateral macronodular adrenal hyperplasia (BMAH) or primary pigmented nodular adrenocortical disease (PPNAD).\",\n",
       " 'H01522': 'DESCRIPTION Zollinger-Ellison syndrome (ZES) is a rare endocrinopathy caused by tumors of the pancreas and duodenum. These tumors, called gastrinomas, release gastrin to produce large amounts of acid that result in severe gastroesophageal peptic ulcer disease and diarrhea. Most ZES cases are sporadic, but about over 20 percent are caused by an inherited genetic disorder called multiple endocrine neoplasia type 1 (MEN1) [DS:H00247]. The clinical presentation is not specific for this disease and there is overlap of symptoms similar to those of a peptic ulcer. The most common symptoms include abdominal pain and diarrhea, sometimes accompanied by heartburn, nausea, and weight loss. Peptic ulceration complicated by bleeding is present in 25% of patients, and is more frequently in patients with sporadic ZES than in those with MEN1. In addition, the gastrinomas may be cancerous. The cancer can be spread to other parts of the body, most commonly to regional lymph nodes and the liver. The treatment of the ZES includes surgical removal and medical management of gastric acid hypersecretion for the prevention of malignant transformation and the genesis of complications.',\n",
       " 'H01603': 'DESCRIPTION Primary aldosteronism is a clinical syndrome characterized by excess secretion of aldosterone from the adrenal gland. It is manifested by hypertension and hyporeninemia. In the past, hypokalemia was thought to be a mandatory finding in primary aldosteronism. However, later studies confirmed that most patients with primary aldosteronism are normokalemic. The prevalence of primary aldosteronism among nonselected hypertensive persons is between 5% and 13%, and it is now recognized to be the most common form of secondary hypertension. There are the seven subtypes of primary aldosteronism. Aldosterone-producing adenoma (APA) and bilateral idiopathic hyperaldosteronism (IHA) are the most common subtypes of primary aldosteronism. Unilateral adrenal hyperplasia, aldosterone-producing adrenocortical carcinoma, ectopic aldosterone-producing adenoma, and familial hyperaldosteronism (type I and typeII) are unusual subtypes. Somatic mutations in KCNJ5, ATP1A1, ATP2B3, and CACNA1D have been described in APAs. Usually, adenomas are managed surgically and bilateral hyperplasia, medically.',\n",
       " 'H02049': \"DESCRIPTION Bilateral macronodular adrenal hyperplasia (BMAH) is an adrenal disorder characterized by bilateral benign adrenocortical nodules associated with variable levels of cortisol excess. BMAH is an adrenal cause of Cushing's syndrome (CS). An increased activity of the cAMP/PKA pathway is found in the various forms of BMAH. Actors of the cAMP/PKA signaling pathway or genes causing a hereditary familial tumor syndrome including adenomatous polyposis coli gene (APC), menin (MEN1) and fumarate hydratase (FH) can favor or be responsible for the development of BMAH. Recently, a new gene, ARMC5, was identified as a frequent cause of sporadic or familial BMAH.\",\n",
       " 'H02221': 'DESCRIPTION Methylmalonic aciduria and homocystinuria (MAHC) is caused by defects of intracellular cobalamin (vitamin B12) metabolism. Derivatives of cobalamin are essential cofactors for enzymes required in intermediary metabolism, and its defects lead to the accumulation of methylmalonic acid and/or homocysteine in blood and urine. Affected persons present with multisystem clinical abnormalities, including developmental, hematologic, neurologic, and metabolic findings.'}"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "disease_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "97d091ec-097c-4028-bcaa-5c3e01ff0d01",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Columns to process\n",
    "cols_to_edit = [\"Disease\"]\n",
    "\n",
    "def put_disease_data(cell):\n",
    "    if pd.isna(cell):\n",
    "        return cell  # Leave NaN as is\n",
    "    gene_dict = {}\n",
    "    for key in cell.keys():\n",
    "        gene_dict[key] = disease_dict[key]\n",
    "    return gene_dict\n",
    "\n",
    "# Apply the transformation to each column\n",
    "for col in cols_to_edit:\n",
    "    network_info[col] = network_info[col].apply(put_disease_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "05257651-5f54-4d05-aa23-b04c1a3f85f7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Entry</th>\n",
       "      <th>Name</th>\n",
       "      <th>Definition</th>\n",
       "      <th>Expanded</th>\n",
       "      <th>Pathway</th>\n",
       "      <th>Class</th>\n",
       "      <th>Disease</th>\n",
       "      <th>Gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N00002</td>\n",
       "      <td>BCR-ABL fusion kinase to RAS-ERK signaling pat...</td>\n",
       "      <td>BCR-ABL -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt;...</td>\n",
       "      <td>(25v1,25v2) -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,38...</td>\n",
       "      <td>{'hsa05220': 'Chronic myeloid leukemia'}</td>\n",
       "      <td>{'nt06276': 'Chronic myeloid leukemia', 'nt062...</td>\n",
       "      <td>{'H00004': 'DESCRIPTION Chronic myeloid leukem...</td>\n",
       "      <td>{'25': 'ABL1; ABL proto-oncogene 1, non-recept...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N00003</td>\n",
       "      <td>Mutation-activated KIT to RAS-ERK signaling pa...</td>\n",
       "      <td>KIT* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>3815v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'DESCRIPTION Acute myeloid leukemia...</td>\n",
       "      <td>{'3815': 'KIT; KIT proto-oncogene receptor tyr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N00004</td>\n",
       "      <td>Duplication or mutation-activated FLT3 to RAS-...</td>\n",
       "      <td>FLT3* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>(2322v2,2322v1) -&gt; 2885 -&gt; (6654,6655) -&gt; (326...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'DESCRIPTION Acute myeloid leukemia...</td>\n",
       "      <td>{'2322': 'FLT3; fms related tyrosine kinase 3'...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N00005</td>\n",
       "      <td>Mutation-activated MET to RAS-ERK signaling pa...</td>\n",
       "      <td>MET* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ER...</td>\n",
       "      <td>4233v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05225': 'Hepatocellular carcinoma', 'hsa0...</td>\n",
       "      <td>{'nt06263': 'Hepatocellular carcinoma', 'nt062...</td>\n",
       "      <td>{'H00048': 'DESCRIPTION Hepatocellular carcino...</td>\n",
       "      <td>{'4233': 'MET; MET proto-oncogene, receptor ty...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N00007</td>\n",
       "      <td>EML4-ALK fusion kinase to RAS-ERK signaling pa...</td>\n",
       "      <td>EML4-ALK -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK -&gt; CCND1</td>\n",
       "      <td>(238v1,238v2) -&gt; (3265,3845,4893) -&gt; (369,673,...</td>\n",
       "      <td>{'hsa05223': 'Non-small cell lung cancer'}</td>\n",
       "      <td>{'nt06266': 'Non-small cell lung cancer', 'nt0...</td>\n",
       "      <td>{'H00014': 'DESCRIPTION Lung cancer is a leadi...</td>\n",
       "      <td>{'238': 'ALK; ALK receptor tyrosine kinase', '...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>N01422</td>\n",
       "      <td>HPRT1 deficiency in purine salvage pathway</td>\n",
       "      <td>(Hypoxanthine,Guanine) // HPRT1*</td>\n",
       "      <td>(C00262,C00242) // 3251v1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>{'nt06027': 'Purine salvage pathway'}</td>\n",
       "      <td>{'H00194': 'DESCRIPTION Deficiency of hypoxant...</td>\n",
       "      <td>{'3251': 'HPRT1; hypoxanthine phosphoribosyltr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169</th>\n",
       "      <td>N01444</td>\n",
       "      <td>NXN mutation to WNT5A-ROR signaling pathway</td>\n",
       "      <td>NXN* -| DVL</td>\n",
       "      <td>64359v1 -| (1855,1856,1857)</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>{'nt06505': 'WNT signaling'}</td>\n",
       "      <td>{'H00485': 'DESCRIPTION Robinow syndrome (RS) ...</td>\n",
       "      <td>{'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>170</th>\n",
       "      <td>N01809</td>\n",
       "      <td>Mutation-caused epigenetic silencing of MMACHC</td>\n",
       "      <td>PRDX1* =| MMACHC</td>\n",
       "      <td>5052v1 =| 25974</td>\n",
       "      <td>{'hsa04980': 'Cobalamin transport and metaboli...</td>\n",
       "      <td>{'nt06538': 'Cobalamin transport and metabolism'}</td>\n",
       "      <td>{'H02221': 'DESCRIPTION Methylmalonic aciduria...</td>\n",
       "      <td>{'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>171</th>\n",
       "      <td>N01873</td>\n",
       "      <td>VHL mutation to HIF-2 signaling pathway</td>\n",
       "      <td>(VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =&gt;...</td>\n",
       "      <td>(7428v3+9978+6921+6923+8453) // 2034 == 405 =&gt;...</td>\n",
       "      <td>{'hsa05211': 'Renal cell carcinoma'}</td>\n",
       "      <td>{'nt06542': 'HIF signaling'}</td>\n",
       "      <td>{'H00021': 'DESCRIPTION Renal cell cancer (RCC...</td>\n",
       "      <td>{'7428': 'VHL; von Hippel-Lindau tumor suppres...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>N01877</td>\n",
       "      <td>ERBB4 mutation to GF-RTK-PI3K signaling pathway</td>\n",
       "      <td>NRG // ERBB4*</td>\n",
       "      <td>(3084,9542,10718,145957) // 2066v1</td>\n",
       "      <td>{'hsa04012': 'ErbB signaling pathway'}</td>\n",
       "      <td>{'nt06543': 'NRG-ERBB signaling'}</td>\n",
       "      <td>{'H00058': 'DESCRIPTION Amyotrophic lateral sc...</td>\n",
       "      <td>{'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>173 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Entry                                               Name  \\\n",
       "0    N00002  BCR-ABL fusion kinase to RAS-ERK signaling pat...   \n",
       "1    N00003  Mutation-activated KIT to RAS-ERK signaling pa...   \n",
       "2    N00004  Duplication or mutation-activated FLT3 to RAS-...   \n",
       "3    N00005  Mutation-activated MET to RAS-ERK signaling pa...   \n",
       "4    N00007  EML4-ALK fusion kinase to RAS-ERK signaling pa...   \n",
       "..      ...                                                ...   \n",
       "168  N01422         HPRT1 deficiency in purine salvage pathway   \n",
       "169  N01444        NXN mutation to WNT5A-ROR signaling pathway   \n",
       "170  N01809     Mutation-caused epigenetic silencing of MMACHC   \n",
       "171  N01873            VHL mutation to HIF-2 signaling pathway   \n",
       "172  N01877    ERBB4 mutation to GF-RTK-PI3K signaling pathway   \n",
       "\n",
       "                                            Definition  \\\n",
       "0    BCR-ABL -> GRB2 -> SOS -> RAS -> RAF -> MEK ->...   \n",
       "1      KIT* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "2     FLT3* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "3    MET* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ER...   \n",
       "4        EML4-ALK -> RAS -> RAF -> MEK -> ERK -> CCND1   \n",
       "..                                                 ...   \n",
       "168                   (Hypoxanthine,Guanine) // HPRT1*   \n",
       "169                                        NXN* -| DVL   \n",
       "170                                   PRDX1* =| MMACHC   \n",
       "171  (VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =>...   \n",
       "172                                      NRG // ERBB4*   \n",
       "\n",
       "                                              Expanded  \\\n",
       "0    (25v1,25v2) -> 2885 -> (6654,6655) -> (3265,38...   \n",
       "1    3815v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "2    (2322v2,2322v1) -> 2885 -> (6654,6655) -> (326...   \n",
       "3    4233v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "4    (238v1,238v2) -> (3265,3845,4893) -> (369,673,...   \n",
       "..                                                 ...   \n",
       "168                          (C00262,C00242) // 3251v1   \n",
       "169                        64359v1 -| (1855,1856,1857)   \n",
       "170                                    5052v1 =| 25974   \n",
       "171  (7428v3+9978+6921+6923+8453) // 2034 == 405 =>...   \n",
       "172                 (3084,9542,10718,145957) // 2066v1   \n",
       "\n",
       "                                               Pathway  \\\n",
       "0             {'hsa05220': 'Chronic myeloid leukemia'}   \n",
       "1               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "2               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "3    {'hsa05225': 'Hepatocellular carcinoma', 'hsa0...   \n",
       "4           {'hsa05223': 'Non-small cell lung cancer'}   \n",
       "..                                                 ...   \n",
       "168                                               <NA>   \n",
       "169                                               <NA>   \n",
       "170  {'hsa04980': 'Cobalamin transport and metaboli...   \n",
       "171               {'hsa05211': 'Renal cell carcinoma'}   \n",
       "172             {'hsa04012': 'ErbB signaling pathway'}   \n",
       "\n",
       "                                                 Class  \\\n",
       "0    {'nt06276': 'Chronic myeloid leukemia', 'nt062...   \n",
       "1    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "2    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "3    {'nt06263': 'Hepatocellular carcinoma', 'nt062...   \n",
       "4    {'nt06266': 'Non-small cell lung cancer', 'nt0...   \n",
       "..                                                 ...   \n",
       "168              {'nt06027': 'Purine salvage pathway'}   \n",
       "169                       {'nt06505': 'WNT signaling'}   \n",
       "170  {'nt06538': 'Cobalamin transport and metabolism'}   \n",
       "171                       {'nt06542': 'HIF signaling'}   \n",
       "172                  {'nt06543': 'NRG-ERBB signaling'}   \n",
       "\n",
       "                                               Disease  \\\n",
       "0    {'H00004': 'DESCRIPTION Chronic myeloid leukem...   \n",
       "1    {'H00003': 'DESCRIPTION Acute myeloid leukemia...   \n",
       "2    {'H00003': 'DESCRIPTION Acute myeloid leukemia...   \n",
       "3    {'H00048': 'DESCRIPTION Hepatocellular carcino...   \n",
       "4    {'H00014': 'DESCRIPTION Lung cancer is a leadi...   \n",
       "..                                                 ...   \n",
       "168  {'H00194': 'DESCRIPTION Deficiency of hypoxant...   \n",
       "169  {'H00485': 'DESCRIPTION Robinow syndrome (RS) ...   \n",
       "170  {'H02221': 'DESCRIPTION Methylmalonic aciduria...   \n",
       "171  {'H00021': 'DESCRIPTION Renal cell cancer (RCC...   \n",
       "172  {'H00058': 'DESCRIPTION Amyotrophic lateral sc...   \n",
       "\n",
       "                                                  Gene  \n",
       "0    {'25': 'ABL1; ABL proto-oncogene 1, non-recept...  \n",
       "1    {'3815': 'KIT; KIT proto-oncogene receptor tyr...  \n",
       "2    {'2322': 'FLT3; fms related tyrosine kinase 3'...  \n",
       "3    {'4233': 'MET; MET proto-oncogene, receptor ty...  \n",
       "4    {'238': 'ALK; ALK receptor tyrosine kinase', '...  \n",
       "..                                                 ...  \n",
       "168  {'3251': 'HPRT1; hypoxanthine phosphoribosyltr...  \n",
       "169  {'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...  \n",
       "170  {'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...  \n",
       "171  {'7428': 'VHL; von Hippel-Lindau tumor suppres...  \n",
       "172  {'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...  \n",
       "\n",
       "[173 rows x 8 columns]"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "network_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "0cad7b6f-d863-49f9-b0a2-644da8beb947",
   "metadata": {},
   "outputs": [],
   "source": [
    "network_info.to_csv(\"network_variant_final_info.tsv\",sep='\\t', header=True, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "3a556f82-3468-44eb-be31-5e9bedf59c70",
   "metadata": {},
   "outputs": [],
   "source": [
    "!sed -i '' 's/DESCRIPTION //g' network_variant_final_info.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a34eb400-5a7d-41c2-b2be-5bb9a3febf57",
   "metadata": {},
   "source": [
    "# Final Merge of Variant Data with Network Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "id": "83d484dd-69d7-4e50-9454-9369223f1dd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "variant_data = pd.read_csv(\"variant_data_together_wo_nt.tsv\", sep='\\t')\n",
    "network_info = pd.read_csv(\"network_variant_final_info.tsv\",sep='\\t')\n",
    "network_info = network_info.rename(columns={\"Entry\":\"Network\", \"Definition\":\"Network Definition\",\"Expanded\":\"Network Expanded\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "id": "63f214f1-e32a-4275-a037-554fd89409aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Network</th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1506</th>\n",
       "      <td>N00244</td>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196635</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.706G&gt;T</td>\n",
       "      <td>19</td>\n",
       "      <td>10492196</td>\n",
       "      <td>10492196</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1507</th>\n",
       "      <td>N00244</td>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196637</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.548A&gt;G</td>\n",
       "      <td>19</td>\n",
       "      <td>10499486</td>\n",
       "      <td>10499486</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1508</th>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1509</th>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766211</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.755T&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68810264</td>\n",
       "      <td>68810264</td>\n",
       "      <td>T</td>\n",
       "      <td>G</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1510</th>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1379150</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.769G&gt;A</td>\n",
       "      <td>16</td>\n",
       "      <td>68810278</td>\n",
       "      <td>68810278</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1511 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Network   ENTRY   Source           ID       TranscriptID NucChange  Chr  \\\n",
       "0     N00073  1019v2  ClinVar        16929       NC_000012.12       NaN   12   \n",
       "1     N00073  1019v2    dbSNP  rs104894340       NC_000012.12       NaN   12   \n",
       "2     N00073  1019v2    dbSNP  rs104894340       NC_000012.12       NaN   12   \n",
       "3     N00073  1019v2  ClinVar        16928       NC_000012.12       NaN   12   \n",
       "4     N00073  1019v2    dbSNP   rs11547328       NC_000012.12       NaN   12   \n",
       "...      ...     ...      ...          ...                ...       ...  ...   \n",
       "1506  N00244  9817v1     COSM      6196635  ENST00000393623.6  c.706G>T   19   \n",
       "1507  N00244  9817v1     COSM      6196637  ENST00000393623.6  c.548A>G   19   \n",
       "1508  N00258   999v2     COSM      4766271  ENST00000621016.4  c.662A>G   16   \n",
       "1509  N00258   999v2     COSM      4766211  ENST00000621016.4  c.755T>G   16   \n",
       "1510  N00258   999v2     COSM      1379150  ENST00000621016.4  c.769G>A   16   \n",
       "\n",
       "         Start       End RefAllele AltAllele  \n",
       "0     57751646  57751646         C         T  \n",
       "1     57751646  57751646         C         A  \n",
       "2     57751646  57751646         C         G  \n",
       "3     57751647  57751647         G         A  \n",
       "4     57751647  57751647         G         C  \n",
       "...        ...       ...       ...       ...  \n",
       "1506  10492196  10492196         C         A  \n",
       "1507  10499486  10499486         T         C  \n",
       "1508  68808823  68808823         A         G  \n",
       "1509  68810264  68810264         T         G  \n",
       "1510  68810278  68810278         G         A  \n",
       "\n",
       "[1511 rows x 11 columns]"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variant_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "id": "a681f1fb-b921-4ec3-b9cb-43df32fe9ef8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Network</th>\n",
       "      <th>Name</th>\n",
       "      <th>Network Definition</th>\n",
       "      <th>Network Expanded</th>\n",
       "      <th>Pathway</th>\n",
       "      <th>Class</th>\n",
       "      <th>Disease</th>\n",
       "      <th>Gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>N00002</td>\n",
       "      <td>BCR-ABL fusion kinase to RAS-ERK signaling pat...</td>\n",
       "      <td>BCR-ABL -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt;...</td>\n",
       "      <td>(25v1,25v2) -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,38...</td>\n",
       "      <td>{'hsa05220': 'Chronic myeloid leukemia'}</td>\n",
       "      <td>{'nt06276': 'Chronic myeloid leukemia', 'nt062...</td>\n",
       "      <td>{'H00004': 'Chronic myeloid leukemia (CML) is ...</td>\n",
       "      <td>{'25': 'ABL1; ABL proto-oncogene 1, non-recept...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>N00003</td>\n",
       "      <td>Mutation-activated KIT to RAS-ERK signaling pa...</td>\n",
       "      <td>KIT* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>3815v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'Acute myeloid leukemia (AML) is a ...</td>\n",
       "      <td>{'3815': 'KIT; KIT proto-oncogene receptor tyr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>N00004</td>\n",
       "      <td>Duplication or mutation-activated FLT3 to RAS-...</td>\n",
       "      <td>FLT3* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK</td>\n",
       "      <td>(2322v2,2322v1) -&gt; 2885 -&gt; (6654,6655) -&gt; (326...</td>\n",
       "      <td>{'hsa05221': 'Acute myeloid leukemia'}</td>\n",
       "      <td>{'nt06275': 'Acute myeloid leukemia', 'nt06210...</td>\n",
       "      <td>{'H00003': 'Acute myeloid leukemia (AML) is a ...</td>\n",
       "      <td>{'2322': 'FLT3; fms related tyrosine kinase 3'...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>N00005</td>\n",
       "      <td>Mutation-activated MET to RAS-ERK signaling pa...</td>\n",
       "      <td>MET* -&gt; GRB2 -&gt; SOS -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ER...</td>\n",
       "      <td>4233v1 -&gt; 2885 -&gt; (6654,6655) -&gt; (3265,3845,48...</td>\n",
       "      <td>{'hsa05225': 'Hepatocellular carcinoma', 'hsa0...</td>\n",
       "      <td>{'nt06263': 'Hepatocellular carcinoma', 'nt062...</td>\n",
       "      <td>{'H00048': 'Hepatocellular carcinoma (HCC) is ...</td>\n",
       "      <td>{'4233': 'MET; MET proto-oncogene, receptor ty...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>N00007</td>\n",
       "      <td>EML4-ALK fusion kinase to RAS-ERK signaling pa...</td>\n",
       "      <td>EML4-ALK -&gt; RAS -&gt; RAF -&gt; MEK -&gt; ERK -&gt; CCND1</td>\n",
       "      <td>(238v1,238v2) -&gt; (3265,3845,4893) -&gt; (369,673,...</td>\n",
       "      <td>{'hsa05223': 'Non-small cell lung cancer'}</td>\n",
       "      <td>{'nt06266': 'Non-small cell lung cancer', 'nt0...</td>\n",
       "      <td>{'H00014': 'Lung cancer is a leading cause of ...</td>\n",
       "      <td>{'238': 'ALK; ALK receptor tyrosine kinase', '...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>N01422</td>\n",
       "      <td>HPRT1 deficiency in purine salvage pathway</td>\n",
       "      <td>(Hypoxanthine,Guanine) // HPRT1*</td>\n",
       "      <td>(C00262,C00242) // 3251v1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'nt06027': 'Purine salvage pathway'}</td>\n",
       "      <td>{'H00194': 'Deficiency of hypoxanthine-guanine...</td>\n",
       "      <td>{'3251': 'HPRT1; hypoxanthine phosphoribosyltr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>169</th>\n",
       "      <td>N01444</td>\n",
       "      <td>NXN mutation to WNT5A-ROR signaling pathway</td>\n",
       "      <td>NXN* -| DVL</td>\n",
       "      <td>64359v1 -| (1855,1856,1857)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'nt06505': 'WNT signaling'}</td>\n",
       "      <td>{'H00485': 'Robinow syndrome (RS) is a rare ge...</td>\n",
       "      <td>{'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>170</th>\n",
       "      <td>N01809</td>\n",
       "      <td>Mutation-caused epigenetic silencing of MMACHC</td>\n",
       "      <td>PRDX1* =| MMACHC</td>\n",
       "      <td>5052v1 =| 25974</td>\n",
       "      <td>{'hsa04980': 'Cobalamin transport and metaboli...</td>\n",
       "      <td>{'nt06538': 'Cobalamin transport and metabolism'}</td>\n",
       "      <td>{'H02221': 'Methylmalonic aciduria and homocys...</td>\n",
       "      <td>{'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>171</th>\n",
       "      <td>N01873</td>\n",
       "      <td>VHL mutation to HIF-2 signaling pathway</td>\n",
       "      <td>(VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =&gt;...</td>\n",
       "      <td>(7428v3+9978+6921+6923+8453) // 2034 == 405 =&gt;...</td>\n",
       "      <td>{'hsa05211': 'Renal cell carcinoma'}</td>\n",
       "      <td>{'nt06542': 'HIF signaling'}</td>\n",
       "      <td>{'H00021': 'Renal cell cancer (RCC) accounts f...</td>\n",
       "      <td>{'7428': 'VHL; von Hippel-Lindau tumor suppres...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>N01877</td>\n",
       "      <td>ERBB4 mutation to GF-RTK-PI3K signaling pathway</td>\n",
       "      <td>NRG // ERBB4*</td>\n",
       "      <td>(3084,9542,10718,145957) // 2066v1</td>\n",
       "      <td>{'hsa04012': 'ErbB signaling pathway'}</td>\n",
       "      <td>{'nt06543': 'NRG-ERBB signaling'}</td>\n",
       "      <td>{'H00058': 'Amyotrophic lateral sclerosis (ALS...</td>\n",
       "      <td>{'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>173 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    Network                                               Name  \\\n",
       "0    N00002  BCR-ABL fusion kinase to RAS-ERK signaling pat...   \n",
       "1    N00003  Mutation-activated KIT to RAS-ERK signaling pa...   \n",
       "2    N00004  Duplication or mutation-activated FLT3 to RAS-...   \n",
       "3    N00005  Mutation-activated MET to RAS-ERK signaling pa...   \n",
       "4    N00007  EML4-ALK fusion kinase to RAS-ERK signaling pa...   \n",
       "..      ...                                                ...   \n",
       "168  N01422         HPRT1 deficiency in purine salvage pathway   \n",
       "169  N01444        NXN mutation to WNT5A-ROR signaling pathway   \n",
       "170  N01809     Mutation-caused epigenetic silencing of MMACHC   \n",
       "171  N01873            VHL mutation to HIF-2 signaling pathway   \n",
       "172  N01877    ERBB4 mutation to GF-RTK-PI3K signaling pathway   \n",
       "\n",
       "                                    Network Definition  \\\n",
       "0    BCR-ABL -> GRB2 -> SOS -> RAS -> RAF -> MEK ->...   \n",
       "1      KIT* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "2     FLT3* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ERK   \n",
       "3    MET* -> GRB2 -> SOS -> RAS -> RAF -> MEK -> ER...   \n",
       "4        EML4-ALK -> RAS -> RAF -> MEK -> ERK -> CCND1   \n",
       "..                                                 ...   \n",
       "168                   (Hypoxanthine,Guanine) // HPRT1*   \n",
       "169                                        NXN* -| DVL   \n",
       "170                                   PRDX1* =| MMACHC   \n",
       "171  (VHL*+RBX1+ELOC+ELOB+CUL2) // EPAS1 == ARNT =>...   \n",
       "172                                      NRG // ERBB4*   \n",
       "\n",
       "                                      Network Expanded  \\\n",
       "0    (25v1,25v2) -> 2885 -> (6654,6655) -> (3265,38...   \n",
       "1    3815v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "2    (2322v2,2322v1) -> 2885 -> (6654,6655) -> (326...   \n",
       "3    4233v1 -> 2885 -> (6654,6655) -> (3265,3845,48...   \n",
       "4    (238v1,238v2) -> (3265,3845,4893) -> (369,673,...   \n",
       "..                                                 ...   \n",
       "168                          (C00262,C00242) // 3251v1   \n",
       "169                        64359v1 -| (1855,1856,1857)   \n",
       "170                                    5052v1 =| 25974   \n",
       "171  (7428v3+9978+6921+6923+8453) // 2034 == 405 =>...   \n",
       "172                 (3084,9542,10718,145957) // 2066v1   \n",
       "\n",
       "                                               Pathway  \\\n",
       "0             {'hsa05220': 'Chronic myeloid leukemia'}   \n",
       "1               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "2               {'hsa05221': 'Acute myeloid leukemia'}   \n",
       "3    {'hsa05225': 'Hepatocellular carcinoma', 'hsa0...   \n",
       "4           {'hsa05223': 'Non-small cell lung cancer'}   \n",
       "..                                                 ...   \n",
       "168                                                NaN   \n",
       "169                                                NaN   \n",
       "170  {'hsa04980': 'Cobalamin transport and metaboli...   \n",
       "171               {'hsa05211': 'Renal cell carcinoma'}   \n",
       "172             {'hsa04012': 'ErbB signaling pathway'}   \n",
       "\n",
       "                                                 Class  \\\n",
       "0    {'nt06276': 'Chronic myeloid leukemia', 'nt062...   \n",
       "1    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "2    {'nt06275': 'Acute myeloid leukemia', 'nt06210...   \n",
       "3    {'nt06263': 'Hepatocellular carcinoma', 'nt062...   \n",
       "4    {'nt06266': 'Non-small cell lung cancer', 'nt0...   \n",
       "..                                                 ...   \n",
       "168              {'nt06027': 'Purine salvage pathway'}   \n",
       "169                       {'nt06505': 'WNT signaling'}   \n",
       "170  {'nt06538': 'Cobalamin transport and metabolism'}   \n",
       "171                       {'nt06542': 'HIF signaling'}   \n",
       "172                  {'nt06543': 'NRG-ERBB signaling'}   \n",
       "\n",
       "                                               Disease  \\\n",
       "0    {'H00004': 'Chronic myeloid leukemia (CML) is ...   \n",
       "1    {'H00003': 'Acute myeloid leukemia (AML) is a ...   \n",
       "2    {'H00003': 'Acute myeloid leukemia (AML) is a ...   \n",
       "3    {'H00048': 'Hepatocellular carcinoma (HCC) is ...   \n",
       "4    {'H00014': 'Lung cancer is a leading cause of ...   \n",
       "..                                                 ...   \n",
       "168  {'H00194': 'Deficiency of hypoxanthine-guanine...   \n",
       "169  {'H00485': 'Robinow syndrome (RS) is a rare ge...   \n",
       "170  {'H02221': 'Methylmalonic aciduria and homocys...   \n",
       "171  {'H00021': 'Renal cell cancer (RCC) accounts f...   \n",
       "172  {'H00058': 'Amyotrophic lateral sclerosis (ALS...   \n",
       "\n",
       "                                                  Gene  \n",
       "0    {'25': 'ABL1; ABL proto-oncogene 1, non-recept...  \n",
       "1    {'3815': 'KIT; KIT proto-oncogene receptor tyr...  \n",
       "2    {'2322': 'FLT3; fms related tyrosine kinase 3'...  \n",
       "3    {'4233': 'MET; MET proto-oncogene, receptor ty...  \n",
       "4    {'238': 'ALK; ALK receptor tyrosine kinase', '...  \n",
       "..                                                 ...  \n",
       "168  {'3251': 'HPRT1; hypoxanthine phosphoribosyltr...  \n",
       "169  {'64359': 'NXN; nucleoredoxin', '1855': 'DVL1;...  \n",
       "170  {'5052': 'PRDX1; peroxiredoxin 1', '25974': 'M...  \n",
       "171  {'7428': 'VHL; von Hippel-Lindau tumor suppres...  \n",
       "172  {'3084': 'NRG1; neuregulin 1', '9542': 'NRG2; ...  \n",
       "\n",
       "[173 rows x 8 columns]"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "network_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "id": "ff9b9542-754c-414a-82c9-4eb8409b19b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data = variant_data.merge(network_info, on='Network')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "id": "5f1e15c8-49f3-4be8-9f4e-e3f73e30f01c",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data.to_csv(\"final_network_with_variant.tsv\",sep='\\t',header=True, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99191ac9-875f-4e5c-89d6-6382b29a9564",
   "metadata": {},
   "source": [
    "# Extracting Human Chromosomes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb75a991-a908-48f1-8615-3099ab06ac66",
   "metadata": {},
   "source": [
    "Downloaded the human genome from here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "048f0501-7a19-4a8f-8a21-5e975f26135b",
   "metadata": {},
   "source": [
    "Got all the chromosomes and their ids that we have variants for\n",
    "\n",
    "NC_000001.11\n",
    "NC_000002.12\n",
    "NC_000003.12\n",
    "NC_000004.12\n",
    "NC_000005.10\n",
    "NC_000006.12\n",
    "NC_000007.14\n",
    "NC_000009.12\n",
    "NC_000010.11\n",
    "NC_000011.10\n",
    "NC_000012.12\n",
    "NC_000013.11\n",
    "NC_000014.9\n",
    "NC_000015.10\n",
    "NC_000016.10\n",
    "NC_000017.11\n",
    "NC_000018.10\n",
    "NC_000019.10\n",
    "NC_000020.11\n",
    "NC_000021.9\n",
    "NC_000023.11\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3d870ed0-b14d-42a8-8d55-b58dc49367f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc0c7ced-c649-4695-bc11-9a7bfb87e128",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[INFO]\u001b[0m 21 patterns loaded from file\n"
     ]
    }
   ],
   "source": [
    "seqkit grep -r -n -f chromosomes.txt /ncbi_dataset/data/GCF_000001405.26/GCF_000001405.26_GRCh38_genomic.fna -o chromosomes.fasta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ef62d80c-f572-4f4e-9443-e3653a178327",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "file               format  type  num_seqs        sum_len     min_len        avg_len      max_len\n",
      "chromosomes.fasta  FASTA   DNA         21  2,835,085,313  46,709,983  135,004,062.5  248,956,422\n"
     ]
    }
   ],
   "source": [
    "seqkit stats chromosomes.fasta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2452900e-93d3-4707-b2a3-0b94224de2cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.7G\tchromosomes.fasta\n"
     ]
    }
   ],
   "source": [
    "du -h chromosomes.fasta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "460d091d-6f98-4c9e-95a3-137601780652",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NC_000001.11 Homo sapiens chromosome 1, GRCh38 Primary Assembly\n",
      "NC_000002.12 Homo sapiens chromosome 2, GRCh38 Primary Assembly\n",
      "NC_000003.12 Homo sapiens chromosome 3, GRCh38 Primary Assembly\n",
      "NC_000004.12 Homo sapiens chromosome 4, GRCh38 Primary Assembly\n",
      "NC_000005.10 Homo sapiens chromosome 5, GRCh38 Primary Assembly\n",
      "NC_000006.12 Homo sapiens chromosome 6, GRCh38 Primary Assembly\n",
      "NC_000007.14 Homo sapiens chromosome 7, GRCh38 Primary Assembly\n",
      "NC_000009.12 Homo sapiens chromosome 9, GRCh38 Primary Assembly\n",
      "NC_000010.11 Homo sapiens chromosome 10, GRCh38 Primary Assembly\n",
      "NC_000011.10 Homo sapiens chromosome 11, GRCh38 Primary Assembly\n",
      "NC_000012.12 Homo sapiens chromosome 12, GRCh38 Primary Assembly\n",
      "NC_000013.11 Homo sapiens chromosome 13, GRCh38 Primary Assembly\n",
      "NC_000014.9 Homo sapiens chromosome 14, GRCh38 Primary Assembly\n",
      "NC_000015.10 Homo sapiens chromosome 15, GRCh38 Primary Assembly\n",
      "NC_000016.10 Homo sapiens chromosome 16, GRCh38 Primary Assembly\n",
      "NC_000017.11 Homo sapiens chromosome 17, GRCh38 Primary Assembly\n",
      "NC_000018.10 Homo sapiens chromosome 18, GRCh38 Primary Assembly\n",
      "NC_000019.10 Homo sapiens chromosome 19, GRCh38 Primary Assembly\n",
      "NC_000020.11 Homo sapiens chromosome 20, GRCh38 Primary Assembly\n",
      "NC_000021.9 Homo sapiens chromosome 21, GRCh38 Primary Assembly\n",
      "NC_000023.11 Homo sapiens chromosome X, GRCh38 Primary Assembly\n"
     ]
    }
   ],
   "source": [
    "seqkit fx2tab chromosomes.fasta | cut -f1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c923727a-9eae-407f-a85f-fb9317ccd3ce",
   "metadata": {},
   "source": [
    "# Creating the Nt Variant Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06b00f03-f71e-4575-80b2-9960be48dba8",
   "metadata": {},
   "outputs": [],
   "source": [
    "cd kegg_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "04b7f027-f8d2-451f-9d1e-b784708079cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1ca8532d-3e55-494f-8712-a1ea56c2b96d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Var_ID</th>\n",
       "      <th>Network</th>\n",
       "      <th>ENTRY</th>\n",
       "      <th>Source</th>\n",
       "      <th>ID</th>\n",
       "      <th>TranscriptID</th>\n",
       "      <th>NucChange</th>\n",
       "      <th>Chr</th>\n",
       "      <th>Start</th>\n",
       "      <th>End</th>\n",
       "      <th>RefAllele</th>\n",
       "      <th>AltAllele</th>\n",
       "      <th>Name</th>\n",
       "      <th>Network Definition</th>\n",
       "      <th>Network Expanded</th>\n",
       "      <th>Pathway</th>\n",
       "      <th>Class</th>\n",
       "      <th>Disease</th>\n",
       "      <th>Gene</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>KEGG_1</td>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16929</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Mutation-activated CDK4 to cell cycle G1/S</td>\n",
       "      <td>(CCND+CDK4*) -&gt; RB1 // E2F</td>\n",
       "      <td>((595,894,896)+1019v2) -&gt; 5925 // (1869,1870,1...</td>\n",
       "      <td>{'hsa05218': 'Melanoma'}</td>\n",
       "      <td>{'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...</td>\n",
       "      <td>{'H00038': 'Melanoma is a form of skin cancer ...</td>\n",
       "      <td>{'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>KEGG_2</td>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Mutation-activated CDK4 to cell cycle G1/S</td>\n",
       "      <td>(CCND+CDK4*) -&gt; RB1 // E2F</td>\n",
       "      <td>((595,894,896)+1019v2) -&gt; 5925 // (1869,1870,1...</td>\n",
       "      <td>{'hsa05218': 'Melanoma'}</td>\n",
       "      <td>{'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...</td>\n",
       "      <td>{'H00038': 'Melanoma is a form of skin cancer ...</td>\n",
       "      <td>{'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>KEGG_3</td>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs104894340</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751646</td>\n",
       "      <td>57751646</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Mutation-activated CDK4 to cell cycle G1/S</td>\n",
       "      <td>(CCND+CDK4*) -&gt; RB1 // E2F</td>\n",
       "      <td>((595,894,896)+1019v2) -&gt; 5925 // (1869,1870,1...</td>\n",
       "      <td>{'hsa05218': 'Melanoma'}</td>\n",
       "      <td>{'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...</td>\n",
       "      <td>{'H00038': 'Melanoma is a form of skin cancer ...</td>\n",
       "      <td>{'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>KEGG_4</td>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>16928</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Mutation-activated CDK4 to cell cycle G1/S</td>\n",
       "      <td>(CCND+CDK4*) -&gt; RB1 // E2F</td>\n",
       "      <td>((595,894,896)+1019v2) -&gt; 5925 // (1869,1870,1...</td>\n",
       "      <td>{'hsa05218': 'Melanoma'}</td>\n",
       "      <td>{'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...</td>\n",
       "      <td>{'H00038': 'Melanoma is a form of skin cancer ...</td>\n",
       "      <td>{'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>KEGG_5</td>\n",
       "      <td>N00073</td>\n",
       "      <td>1019v2</td>\n",
       "      <td>dbSNP</td>\n",
       "      <td>rs11547328</td>\n",
       "      <td>NC_000012.12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12</td>\n",
       "      <td>57751647</td>\n",
       "      <td>57751647</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Mutation-activated CDK4 to cell cycle G1/S</td>\n",
       "      <td>(CCND+CDK4*) -&gt; RB1 // E2F</td>\n",
       "      <td>((595,894,896)+1019v2) -&gt; 5925 // (1869,1870,1...</td>\n",
       "      <td>{'hsa05218': 'Melanoma'}</td>\n",
       "      <td>{'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...</td>\n",
       "      <td>{'H00038': 'Melanoma is a form of skin cancer ...</td>\n",
       "      <td>{'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1444</th>\n",
       "      <td>KEGG_1445</td>\n",
       "      <td>N00244</td>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196635</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.706G&gt;T</td>\n",
       "      <td>19</td>\n",
       "      <td>10492196</td>\n",
       "      <td>10492196</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Mutation-inactivated KEAP1 to KEAP1-NRF2 signa...</td>\n",
       "      <td>KEAP1* // NRF2 =&gt; (HMOX1,NQO1,GST,TXNRD1)</td>\n",
       "      <td>9817v1 // 4780 =&gt; (3162,1728,119391,221357,293...</td>\n",
       "      <td>{'hsa05225': 'Hepatocellular carcinoma'}</td>\n",
       "      <td>{'nt06263': 'Hepatocellular carcinoma', 'nt062...</td>\n",
       "      <td>{'H00048': 'Hepatocellular carcinoma (HCC) is ...</td>\n",
       "      <td>{'9817': 'KEAP1; kelch like ECH associated pro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1445</th>\n",
       "      <td>KEGG_1446</td>\n",
       "      <td>N00244</td>\n",
       "      <td>9817v1</td>\n",
       "      <td>COSM</td>\n",
       "      <td>6196637</td>\n",
       "      <td>ENST00000393623.6</td>\n",
       "      <td>c.548A&gt;G</td>\n",
       "      <td>19</td>\n",
       "      <td>10499486</td>\n",
       "      <td>10499486</td>\n",
       "      <td>T</td>\n",
       "      <td>C</td>\n",
       "      <td>Mutation-inactivated KEAP1 to KEAP1-NRF2 signa...</td>\n",
       "      <td>KEAP1* // NRF2 =&gt; (HMOX1,NQO1,GST,TXNRD1)</td>\n",
       "      <td>9817v1 // 4780 =&gt; (3162,1728,119391,221357,293...</td>\n",
       "      <td>{'hsa05225': 'Hepatocellular carcinoma'}</td>\n",
       "      <td>{'nt06263': 'Hepatocellular carcinoma', 'nt062...</td>\n",
       "      <td>{'H00048': 'Hepatocellular carcinoma (HCC) is ...</td>\n",
       "      <td>{'9817': 'KEAP1; kelch like ECH associated pro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1446</th>\n",
       "      <td>KEGG_1447</td>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766271</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.662A&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68808823</td>\n",
       "      <td>68808823</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Mutation-inactivated CDH1 to beta-catenin sign...</td>\n",
       "      <td>CDH1* // CTNNB1 -&gt; TCF/LEF =&gt; (MYC,CCND1)</td>\n",
       "      <td>999v2 // 1499 -&gt; (6932,83439,6934,51176) =&gt; (4...</td>\n",
       "      <td>{'hsa05226': 'Gastric cancer'}</td>\n",
       "      <td>{'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...</td>\n",
       "      <td>{'H00018': \"Gastric cancer (GC) is one of the ...</td>\n",
       "      <td>{'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1447</th>\n",
       "      <td>KEGG_1448</td>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>4766211</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.755T&gt;G</td>\n",
       "      <td>16</td>\n",
       "      <td>68810264</td>\n",
       "      <td>68810264</td>\n",
       "      <td>T</td>\n",
       "      <td>G</td>\n",
       "      <td>Mutation-inactivated CDH1 to beta-catenin sign...</td>\n",
       "      <td>CDH1* // CTNNB1 -&gt; TCF/LEF =&gt; (MYC,CCND1)</td>\n",
       "      <td>999v2 // 1499 -&gt; (6932,83439,6934,51176) =&gt; (4...</td>\n",
       "      <td>{'hsa05226': 'Gastric cancer'}</td>\n",
       "      <td>{'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...</td>\n",
       "      <td>{'H00018': \"Gastric cancer (GC) is one of the ...</td>\n",
       "      <td>{'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1448</th>\n",
       "      <td>KEGG_1449</td>\n",
       "      <td>N00258</td>\n",
       "      <td>999v2</td>\n",
       "      <td>COSM</td>\n",
       "      <td>1379150</td>\n",
       "      <td>ENST00000621016.4</td>\n",
       "      <td>c.769G&gt;A</td>\n",
       "      <td>16</td>\n",
       "      <td>68810278</td>\n",
       "      <td>68810278</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Mutation-inactivated CDH1 to beta-catenin sign...</td>\n",
       "      <td>CDH1* // CTNNB1 -&gt; TCF/LEF =&gt; (MYC,CCND1)</td>\n",
       "      <td>999v2 // 1499 -&gt; (6932,83439,6934,51176) =&gt; (4...</td>\n",
       "      <td>{'hsa05226': 'Gastric cancer'}</td>\n",
       "      <td>{'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...</td>\n",
       "      <td>{'H00018': \"Gastric cancer (GC) is one of the ...</td>\n",
       "      <td>{'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1449 rows × 19 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Var_ID Network   ENTRY   Source           ID       TranscriptID  \\\n",
       "0        KEGG_1  N00073  1019v2  ClinVar        16929       NC_000012.12   \n",
       "1        KEGG_2  N00073  1019v2    dbSNP  rs104894340       NC_000012.12   \n",
       "2        KEGG_3  N00073  1019v2    dbSNP  rs104894340       NC_000012.12   \n",
       "3        KEGG_4  N00073  1019v2  ClinVar        16928       NC_000012.12   \n",
       "4        KEGG_5  N00073  1019v2    dbSNP   rs11547328       NC_000012.12   \n",
       "...         ...     ...     ...      ...          ...                ...   \n",
       "1444  KEGG_1445  N00244  9817v1     COSM      6196635  ENST00000393623.6   \n",
       "1445  KEGG_1446  N00244  9817v1     COSM      6196637  ENST00000393623.6   \n",
       "1446  KEGG_1447  N00258   999v2     COSM      4766271  ENST00000621016.4   \n",
       "1447  KEGG_1448  N00258   999v2     COSM      4766211  ENST00000621016.4   \n",
       "1448  KEGG_1449  N00258   999v2     COSM      1379150  ENST00000621016.4   \n",
       "\n",
       "     NucChange  Chr     Start       End RefAllele AltAllele  \\\n",
       "0          NaN   12  57751646  57751646         C         T   \n",
       "1          NaN   12  57751646  57751646         C         A   \n",
       "2          NaN   12  57751646  57751646         C         G   \n",
       "3          NaN   12  57751647  57751647         G         A   \n",
       "4          NaN   12  57751647  57751647         G         C   \n",
       "...        ...  ...       ...       ...       ...       ...   \n",
       "1444  c.706G>T   19  10492196  10492196         C         A   \n",
       "1445  c.548A>G   19  10499486  10499486         T         C   \n",
       "1446  c.662A>G   16  68808823  68808823         A         G   \n",
       "1447  c.755T>G   16  68810264  68810264         T         G   \n",
       "1448  c.769G>A   16  68810278  68810278         G         A   \n",
       "\n",
       "                                                   Name  \\\n",
       "0            Mutation-activated CDK4 to cell cycle G1/S   \n",
       "1            Mutation-activated CDK4 to cell cycle G1/S   \n",
       "2            Mutation-activated CDK4 to cell cycle G1/S   \n",
       "3            Mutation-activated CDK4 to cell cycle G1/S   \n",
       "4            Mutation-activated CDK4 to cell cycle G1/S   \n",
       "...                                                 ...   \n",
       "1444  Mutation-inactivated KEAP1 to KEAP1-NRF2 signa...   \n",
       "1445  Mutation-inactivated KEAP1 to KEAP1-NRF2 signa...   \n",
       "1446  Mutation-inactivated CDH1 to beta-catenin sign...   \n",
       "1447  Mutation-inactivated CDH1 to beta-catenin sign...   \n",
       "1448  Mutation-inactivated CDH1 to beta-catenin sign...   \n",
       "\n",
       "                             Network Definition  \\\n",
       "0                    (CCND+CDK4*) -> RB1 // E2F   \n",
       "1                    (CCND+CDK4*) -> RB1 // E2F   \n",
       "2                    (CCND+CDK4*) -> RB1 // E2F   \n",
       "3                    (CCND+CDK4*) -> RB1 // E2F   \n",
       "4                    (CCND+CDK4*) -> RB1 // E2F   \n",
       "...                                         ...   \n",
       "1444  KEAP1* // NRF2 => (HMOX1,NQO1,GST,TXNRD1)   \n",
       "1445  KEAP1* // NRF2 => (HMOX1,NQO1,GST,TXNRD1)   \n",
       "1446  CDH1* // CTNNB1 -> TCF/LEF => (MYC,CCND1)   \n",
       "1447  CDH1* // CTNNB1 -> TCF/LEF => (MYC,CCND1)   \n",
       "1448  CDH1* // CTNNB1 -> TCF/LEF => (MYC,CCND1)   \n",
       "\n",
       "                                       Network Expanded  \\\n",
       "0     ((595,894,896)+1019v2) -> 5925 // (1869,1870,1...   \n",
       "1     ((595,894,896)+1019v2) -> 5925 // (1869,1870,1...   \n",
       "2     ((595,894,896)+1019v2) -> 5925 // (1869,1870,1...   \n",
       "3     ((595,894,896)+1019v2) -> 5925 // (1869,1870,1...   \n",
       "4     ((595,894,896)+1019v2) -> 5925 // (1869,1870,1...   \n",
       "...                                                 ...   \n",
       "1444  9817v1 // 4780 => (3162,1728,119391,221357,293...   \n",
       "1445  9817v1 // 4780 => (3162,1728,119391,221357,293...   \n",
       "1446  999v2 // 1499 -> (6932,83439,6934,51176) => (4...   \n",
       "1447  999v2 // 1499 -> (6932,83439,6934,51176) => (4...   \n",
       "1448  999v2 // 1499 -> (6932,83439,6934,51176) => (4...   \n",
       "\n",
       "                                       Pathway  \\\n",
       "0                     {'hsa05218': 'Melanoma'}   \n",
       "1                     {'hsa05218': 'Melanoma'}   \n",
       "2                     {'hsa05218': 'Melanoma'}   \n",
       "3                     {'hsa05218': 'Melanoma'}   \n",
       "4                     {'hsa05218': 'Melanoma'}   \n",
       "...                                        ...   \n",
       "1444  {'hsa05225': 'Hepatocellular carcinoma'}   \n",
       "1445  {'hsa05225': 'Hepatocellular carcinoma'}   \n",
       "1446            {'hsa05226': 'Gastric cancer'}   \n",
       "1447            {'hsa05226': 'Gastric cancer'}   \n",
       "1448            {'hsa05226': 'Gastric cancer'}   \n",
       "\n",
       "                                                  Class  \\\n",
       "0     {'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...   \n",
       "1     {'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...   \n",
       "2     {'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...   \n",
       "3     {'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...   \n",
       "4     {'nt06268': 'Melanoma', 'nt06230': 'Cell cycle...   \n",
       "...                                                 ...   \n",
       "1444  {'nt06263': 'Hepatocellular carcinoma', 'nt062...   \n",
       "1445  {'nt06263': 'Hepatocellular carcinoma', 'nt062...   \n",
       "1446  {'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...   \n",
       "1447  {'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...   \n",
       "1448  {'nt06261': 'Gastric cancer', 'nt06215': 'WNT ...   \n",
       "\n",
       "                                                Disease  \\\n",
       "0     {'H00038': 'Melanoma is a form of skin cancer ...   \n",
       "1     {'H00038': 'Melanoma is a form of skin cancer ...   \n",
       "2     {'H00038': 'Melanoma is a form of skin cancer ...   \n",
       "3     {'H00038': 'Melanoma is a form of skin cancer ...   \n",
       "4     {'H00038': 'Melanoma is a form of skin cancer ...   \n",
       "...                                                 ...   \n",
       "1444  {'H00048': 'Hepatocellular carcinoma (HCC) is ...   \n",
       "1445  {'H00048': 'Hepatocellular carcinoma (HCC) is ...   \n",
       "1446  {'H00018': \"Gastric cancer (GC) is one of the ...   \n",
       "1447  {'H00018': \"Gastric cancer (GC) is one of the ...   \n",
       "1448  {'H00018': \"Gastric cancer (GC) is one of the ...   \n",
       "\n",
       "                                                   Gene  \n",
       "0     {'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...  \n",
       "1     {'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...  \n",
       "2     {'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...  \n",
       "3     {'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...  \n",
       "4     {'595': 'CCND1; cyclin D1', '894': 'CCND2; cyc...  \n",
       "...                                                 ...  \n",
       "1444  {'9817': 'KEAP1; kelch like ECH associated pro...  \n",
       "1445  {'9817': 'KEAP1; kelch like ECH associated pro...  \n",
       "1446  {'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...  \n",
       "1447  {'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...  \n",
       "1448  {'999': 'CDH1; cadherin 1', '1499': 'CTNNB1; c...  \n",
       "\n",
       "[1449 rows x 19 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variant_data = pd.read_csv(\"final_network_with_variant.tsv\", sep='\\t')\n",
    "variant_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ae73bfae-91a9-40a9-bfdb-c14b1d3e14ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1449"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(variant_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6e515f9d-b9a6-4a24-bde6-2496a823b9ba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'N00073'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variant_data.iloc[1][\"Network\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "488f8ed2-2a5b-4831-a5b7-90f3e049614f",
   "metadata": {},
   "outputs": [],
   "source": [
    "fasta_file = \"chromosomes.fasta\"\n",
    "record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file, \"fasta\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6c04e6aa-d700-427c-a4ce-cba8225e3024",
   "metadata": {},
   "outputs": [],
   "source": [
    "chromosome_dictionary = {\n",
    "    \"1\": \"NC_000001.11\",\n",
    "    \"2\": \"NC_000002.12\",\n",
    "    \"3\": \"NC_000003.12\",\n",
    "    \"4\": \"NC_000004.12\",\n",
    "    \"5\": \"NC_000005.10\",\n",
    "    \"6\": \"NC_000006.12\",\n",
    "    \"7\": \"NC_000007.14\",\n",
    "    \"9\": \"NC_000009.12\",\n",
    "    \"10\": \"NC_000010.11\",\n",
    "    \"11\": \"NC_000011.10\",\n",
    "    \"12\": \"NC_000012.12\",\n",
    "    \"13\": \"NC_000013.11\",\n",
    "    \"14\": \"NC_000014.9\",\n",
    "    \"15\": \"NC_000015.10\",\n",
    "    \"16\": \"NC_000016.10\",\n",
    "    \"17\": \"NC_000017.11\",\n",
    "    \"18\": \"NC_000018.10\",\n",
    "    \"19\": \"NC_000019.10\",\n",
    "    \"20\": \"NC_000020.11\",\n",
    "    \"21\": \"NC_000021.9\",\n",
    "    \"23\": \"NC_000023.11\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a3550d7-04a4-44f3-a7d5-b61d30890ef0",
   "metadata": {},
   "source": [
    "### Verification that the reference is present at the exact position I have in my data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b593ef66-65e3-411a-ac95-a33c9d37667a",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"verification.txt\", \"w\") as f:\n",
    "    for i in range(len(variant_data)):\n",
    "        # ---- Input ----\n",
    "        chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n",
    "        if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n",
    "            start = variant_data.iloc[i]['Start'] - 1\n",
    "        else:\n",
    "            start = variant_data.iloc[i]['Start']\n",
    "        reference_allele = variant_data.iloc[i]['RefAllele']\n",
    "        end = len(reference_allele) + start\n",
    "\n",
    "        chrom_seq = record_dict[chromosome_id].seq\n",
    "\n",
    "        # Adjust for 0-based indexing in Python\n",
    "        genomic_ref = chrom_seq[start: start + len(reference_allele)]\n",
    "\n",
    "        if genomic_ref.upper() != reference_allele.upper():\n",
    "            f.write(f\"⚠️ Warning: Entry number {i} with variant {variant_data.iloc[i]['ID']} expected '{reference_allele}', but found '{genomic_ref}'\\n\")\n",
    "        else:\n",
    "            f.write(f\"✅ Verified: {chromosome_id}:{start}-{end} → '{reference_allele}' matches genome\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "02044565-4f9c-45f9-b59a-63590b571dd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir nt_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "361619c9-7b49-45dd-901a-625cf1642535",
   "metadata": {},
   "source": [
    "### Performing the mutation and saving the reference and variant allele with a 1000 nt window"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "96392c0b-c3fd-49ee-a2c1-97cef4127617",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(len(variant_data)):\n",
    "    with open(f\"nt_seq/{variant_data.iloc[i]['Var_ID']}.txt\", \"w\") as f:\n",
    "        # ---- Input ----\n",
    "        chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n",
    "        if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n",
    "            start = variant_data.iloc[i]['Start'] - 1\n",
    "        else:\n",
    "            start = variant_data.iloc[i]['Start']\n",
    "        reference_allele = variant_data.iloc[i]['RefAllele']\n",
    "        variant_allele = variant_data.iloc[i]['AltAllele']\n",
    "\n",
    "        end = len(reference_allele) + start\n",
    "        window = 1000\n",
    "        \n",
    "        chrom_seq = record_dict[chromosome_id].seq\n",
    "\n",
    "        # Extract region\n",
    "        region_start = max(0, start - window)\n",
    "        region_end = end + window\n",
    "\n",
    "        ref_seq = chrom_seq[region_start:region_end]\n",
    "    \n",
    "        if (variant_allele == \"deletion\"):\n",
    "            # Apply mutation\n",
    "            mutated_seq = ref_seq[:window] + variant_allele + ref_seq[window + len(reference_allele):]\n",
    "    \n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n",
    "            f.write(f\"{ref_seq}\\n\")\n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n",
    "            f.write(f\"{mutated_seq}\\n\")\n",
    "        else:\n",
    "            del_len = len(reference_allele)\n",
    "            # Apply mutation\n",
    "            mutated_seq = ref_seq[:window] + ref_seq[window + del_len:]\n",
    "    \n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n",
    "            f.write(f\"{ref_seq}\\n\")\n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n",
    "            f.write(f\"{mutated_seq}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e06b86fd-2d31-486e-82ed-80dbb7f3b627",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}