{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0103d07d",
   "metadata": {},
   "source": [
    "# KEGG Data Processing Pipeline - Part 2: Variant Information Parsing and Sequence Generation\n",
    "\n",
    "## Overview\n",
    "\n",
    "This notebook is the second part of the KEGG data processing pipeline. It focuses on parsing variant information from KEGG data, generating nucleotide sequences with mutations, and creating disease mapping databases.\n",
    "\n",
    "## What This Notebook Does\n",
    "\n",
    "1. **Variant Information Parsing**: Extracts detailed information from KEGG variant files\n",
    "2. **Sequence Generation**: Creates reference and variant nucleotide sequences with genomic context\n",
    "3. **Disease Mapping**: Downloads and processes KEGG disease information\n",
    "4. **Data Integration**: Merges variant data with genomic sequences and disease annotations\n",
    "5. **Quality Control**: Validates reference sequences against the genome\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "**Required from Part 1 (KEGG_Data_1.ipynb):**\n",
    "- `gene_variants.txt` - List of variant identifiers\n",
    "- `variant_info/` directory - Individual variant information files\n",
    "- `final_network_with_variant.tsv` - Network and variant mapping data\n",
    "\n",
    "**Additional Requirements:**\n",
    "- Reference genome FASTA file (GRCh38)\n",
    "- BioPython for sequence processing\n",
    "- KEGG_pull for disease information retrieval\n",
    "\n",
    "## Required Packages\n",
    "\n",
    "```bash\n",
    "pip install biopython pandas kegg-pull\n",
    "```\n",
    "\n",
    "## Input Files Expected\n",
    "\n",
    "- `gene_variants.txt` - Variant identifiers from Part 1\n",
    "- `variant_info/*.txt` - Individual variant information files\n",
    "- `chromosomes.fasta` - Reference genome sequences\n",
    "- `final_network_with_variant.tsv` - Network-variant mapping\n",
    "\n",
    "## Output Files Generated\n",
    "\n",
    "- `nt_seq/` - Directory containing reference and variant sequences\n",
    "- `verification.txt` - Quality control results\n",
    "- `diseases.txt` - List of disease identifiers\n",
    "- `disease_info/` - Disease information files\n",
    "- Updated `final_network_with_variant.tsv` with disease names\n",
    "\n",
    "## Important Notes\n",
    "\n",
    "- **Memory Usage**: Processing large genomic sequences requires significant RAM\n",
    "- **Storage**: Generated sequence files can be several GB in size\n",
    "- **Processing Time**: Full pipeline may take several hours depending on dataset size\n",
    "- **Dependencies**: Requires successful completion of KEGG_Data_1.ipynb\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "After completing this notebook, run `KEGG_Data_3.ipynb` for final dataset creation and sequence integration."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccc3ca96",
   "metadata": {},
   "source": [
    "## Configuration\n",
    "\n",
    "Set up paths and parameters for variant processing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28d2629e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configuration - Update these paths for your environment\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "# Navigate to kegg_data directory\n",
    "data_dir = Path('kegg_data')\n",
    "if not data_dir.exists():\n",
    "    print(\"❌ kegg_data directory not found. Please run KEGG_Data_1.ipynb first.\")\n",
    "    raise FileNotFoundError(\"kegg_data directory missing\")\n",
    "\n",
    "os.chdir(data_dir)\n",
    "\n",
    "# Configuration parameters\n",
    "CONFIG = {\n",
    "    # Input files (should exist from Part 1)\n",
    "    'gene_variants_file': 'gene_variants.txt',\n",
    "    'variant_info_dir': 'variant_info',\n",
    "    'network_data_file': 'final_network_with_variant.tsv',\n",
    "    \n",
    "    # Reference genome (update path as needed)\n",
    "    'reference_fasta': 'chromosomes.fasta',  # Update to your reference genome path\n",
    "    \n",
    "    # Output directories\n",
    "    'nt_seq_dir': 'nt_seq',\n",
    "    'disease_info_dir': 'disease_info',\n",
    "    \n",
    "    # Processing parameters\n",
    "    'sequence_window': 2000,  # Nucleotides around variant\n",
    "    'verification_file': 'verification.txt',\n",
    "    'diseases_file': 'diseases.txt'\n",
    "}\n",
    "\n",
    "# Verify required input files\n",
    "required_files = ['gene_variants.txt', 'final_network_with_variant.tsv']\n",
    "missing_files = []\n",
    "for file in required_files:\n",
    "    if not os.path.exists(file):\n",
    "        missing_files.append(file)\n",
    "\n",
    "if missing_files:\n",
    "    print(f\"❌ Missing required files: {missing_files}\")\n",
    "    print(\"Please run KEGG_Data_1.ipynb first to generate these files.\")\n",
    "else:\n",
    "    print(\"✅ All required input files found\")\n",
    "\n",
    "# Create output directories\n",
    "for dir_name in [CONFIG['nt_seq_dir'], CONFIG['disease_info_dir']]:\n",
    "    Path(dir_name).mkdir(exist_ok=True)\n",
    "\n",
    "print(f\"Working directory: {os.getcwd()}\")\n",
    "print(\"\\n📝 Update CONFIG['reference_fasta'] with path to your reference genome file\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d05a4d10-03de-42ae-89c1-5ddbe77043a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working directory already set in configuration section above\n",
    "print(f\"Current working directory: {os.getcwd()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "96662dbb-ee2c-4a74-8e45-ab58a3496976",
   "metadata": {},
   "outputs": [],
   "source": [
    "sed -i '' 's/:/_/g' gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "db4f4cf2-cd95-4df8-99b6-cc112857502f",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q NAME variant_info/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < gene_variants.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "11959296-d5cb-4fb4-9914-83596dd41c86",
   "metadata": {},
   "outputs": [],
   "source": [
    "while read p; do\n",
    "    if ! grep -q GENE variant_info/$p.txt; then\n",
    "        echo \"$p\"\n",
    "    fi\n",
    "done < gene_variants.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "784d0394-1a14-471a-9def-f4877b4bbd4e",
   "metadata": {},
   "source": [
    "# Pulling Info from the Variant File\n",
    "\n",
    "# Variant Information Parsing\n",
    "\n",
    "This section processes individual variant files to extract structured information including variant names, genes, and types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62b4167a-6d5a-4120-99fe-5678227db6cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working directory already set - proceeding with variant information parsing\n",
    "print(f\"Processing variant files from: {CONFIG['variant_info_dir']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ed32b62-e3a6-4cff-b4ab-a80f04725a1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "# Read all file names from gene_variants.txt\n",
    "gene_variants_file = CONFIG['gene_variants_file']\n",
    "if not os.path.exists(gene_variants_file):\n",
    "    print(f\"❌ Gene variants file not found: {gene_variants_file}\")\n",
    "    print(\"Please run KEGG_Data_1.ipynb first to generate this file\")\n",
    "    raise FileNotFoundError(f\"Gene variants file not found: {gene_variants_file}\")\n",
    "\n",
    "with open(gene_variants_file, 'r') as f:\n",
    "    variant_files = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "print(f\"Processing {len(variant_files)} variant files\")\n",
    "\n",
    "# Initialize an empty DataFrame to collect the results\n",
    "variant_info = pd.DataFrame(columns=[\"Entry\", \"Variant_Name\", \"Variant_Gene\", \"Variant_Gene Info\", \"Variant_Type\"])\n",
    "\n",
    "# Function to extract the value after a keyword (single line, rest of the line)\n",
    "def extract_value(line, key):\n",
    "    return line.split(key, 1)[-1].strip()\n",
    "\n",
    "# Process each variant file\n",
    "variant_info_dir = Path(CONFIG['variant_info_dir'])\n",
    "processed_count = 0\n",
    "not_found_count = 0\n",
    "\n",
    "for file_name in variant_files:\n",
    "    file_path = variant_info_dir / f\"{file_name}.txt\"\n",
    "\n",
    "    try:\n",
    "        with open(file_path, 'r') as f:\n",
    "            lines = f.readlines()\n",
    "\n",
    "        name = \"\"\n",
    "        gene = \"\"\n",
    "        gene_info = \"\"\n",
    "        type_info = \"\"\n",
    "\n",
    "        for line in lines:\n",
    "            line = line.strip()\n",
    "            if line.startswith(\"NAME\"):\n",
    "                name = extract_value(line, \"NAME\")\n",
    "            elif line.startswith(\"GENE\"):\n",
    "                gene_data = extract_value(line, \"GENE\")\n",
    "                if gene_data:\n",
    "                    parts = gene_data.split(maxsplit=1)\n",
    "                    gene = parts[0]\n",
    "                    gene_info = parts[1] if len(parts) > 1 else \"\"\n",
    "            elif line.startswith(\"TYPE\"):\n",
    "                type_info = extract_value(line, \"TYPE\")\n",
    "\n",
    "        row = {\n",
    "            \"Entry\": file_name,\n",
    "            \"Variant_Name\": name,\n",
    "            \"Variant_Gene\": gene,\n",
    "            \"Variant_Gene Info\": gene_info,\n",
    "            \"Variant_Type\": type_info\n",
    "        }\n",
    "\n",
    "        variant_info = pd.concat([variant_info, pd.DataFrame([row])], ignore_index=True)\n",
    "        processed_count += 1\n",
    "        \n",
    "        if processed_count % 100 == 0:\n",
    "            print(f\"Processed {processed_count}/{len(variant_files)} files...\")\n",
    "\n",
    "    except FileNotFoundError:\n",
    "        print(f\"[Warning] File not found: {file_path}\")\n",
    "        not_found_count += 1\n",
    "\n",
    "print(f\"✅ Processing complete: {processed_count} files processed, {not_found_count} files not found\")\n",
    "print(f\"Extracted information for {len(variant_info)} variants\")\n",
    "\n",
    "# Optional: Save the final table\n",
    "# variant_info.to_csv(\"parsed_variant_info.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "85e94a07-740d-44cd-a1c3-2330a30b99b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "variant_info[\"Entry\"] = variant_info[\"Entry\"].str.replace(\"hsa_var_\", \"\", regex=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "4fc8fa00-2a28-4bd9-9aed-5c4602969cca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Entry</th>\n",
       "      <th>Variant_Name</th>\n",
       "      <th>Variant_Gene</th>\n",
       "      <th>Variant_Gene Info</th>\n",
       "      <th>Variant_Type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1019v2</td>\n",
       "      <td>CDK4 mutation</td>\n",
       "      <td>CDK4</td>\n",
       "      <td>cyclin dependent kinase 4 [KO:K02089]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1027v3</td>\n",
       "      <td>CDKN1B mutation</td>\n",
       "      <td>CDKN1B</td>\n",
       "      <td>cyclin dependent kinase inhibitor 1B [KO:K06624]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10280v1</td>\n",
       "      <td>SIGMAR1 mutation</td>\n",
       "      <td>SIGMAR1</td>\n",
       "      <td>sigma non-opioid intracellular receptor 1 [KO:...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1029v2</td>\n",
       "      <td>CDKN2A mutation</td>\n",
       "      <td>CDKN2A</td>\n",
       "      <td>cyclin dependent kinase inhibitor 2A [KO:K06621]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11315v1</td>\n",
       "      <td>PARK7 mutation</td>\n",
       "      <td>PARK7</td>\n",
       "      <td>Parkinsonism associated deglycase [KO:K05687]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>9049v1</td>\n",
       "      <td>AIP mutation</td>\n",
       "      <td>AIP</td>\n",
       "      <td>AHR interacting HSP90 co-chaperone [KO:K17767]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>9101v1</td>\n",
       "      <td>USP8 mutation</td>\n",
       "      <td>USP8</td>\n",
       "      <td>ubiquitin specific peptidase 8 [KO:K11839]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92</th>\n",
       "      <td>9217v1</td>\n",
       "      <td>VAPB mutation</td>\n",
       "      <td>VAPB</td>\n",
       "      <td>VAMP associated protein B and C [KO:K10707]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93</th>\n",
       "      <td>9817v1</td>\n",
       "      <td>KEAP1 mutation</td>\n",
       "      <td>KEAP1</td>\n",
       "      <td>kelch like ECH associated protein 1 [KO:K10456]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94</th>\n",
       "      <td>999v2</td>\n",
       "      <td>CDH1 mutation</td>\n",
       "      <td>CDH1</td>\n",
       "      <td>cadherin 1 [KO:K05689]</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>95 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Entry      Variant_Name Variant_Gene  \\\n",
       "0    1019v2     CDK4 mutation         CDK4   \n",
       "1    1027v3   CDKN1B mutation       CDKN1B   \n",
       "2   10280v1  SIGMAR1 mutation      SIGMAR1   \n",
       "3    1029v2   CDKN2A mutation       CDKN2A   \n",
       "4   11315v1    PARK7 mutation        PARK7   \n",
       "..      ...               ...          ...   \n",
       "90   9049v1      AIP mutation          AIP   \n",
       "91   9101v1     USP8 mutation         USP8   \n",
       "92   9217v1     VAPB mutation         VAPB   \n",
       "93   9817v1    KEAP1 mutation        KEAP1   \n",
       "94    999v2     CDH1 mutation         CDH1   \n",
       "\n",
       "                                    Variant_Gene Info Variant_Type  \n",
       "0               cyclin dependent kinase 4 [KO:K02089]               \n",
       "1    cyclin dependent kinase inhibitor 1B [KO:K06624]               \n",
       "2   sigma non-opioid intracellular receptor 1 [KO:...               \n",
       "3    cyclin dependent kinase inhibitor 2A [KO:K06621]               \n",
       "4       Parkinsonism associated deglycase [KO:K05687]               \n",
       "..                                                ...          ...  \n",
       "90     AHR interacting HSP90 co-chaperone [KO:K17767]               \n",
       "91         ubiquitin specific peptidase 8 [KO:K11839]               \n",
       "92        VAMP associated protein B and C [KO:K10707]               \n",
       "93    kelch like ECH associated protein 1 [KO:K10456]               \n",
       "94                             cadherin 1 [KO:K05689]               \n",
       "\n",
       "[95 rows x 5 columns]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variant_info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "485ddbd6",
   "metadata": {},
   "source": [
    "# Creating the Nt Variant Database\n",
    "\n",
    "# Nucleotide Sequence Database Creation\n",
    "\n",
    "This section creates nucleotide sequences with genomic context around each variant, generating both reference and mutated sequences for downstream analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8dba21b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working directory already set - proceeding with nucleotide sequence processing\n",
    "print(\"Starting nucleotide variant database creation...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8cf9f795",
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bc18349",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "# Load network and variant data\n",
    "network_file = CONFIG['network_data_file']\n",
    "if not os.path.exists(network_file):\n",
    "    print(f\"❌ Network data file not found: {network_file}\")\n",
    "    print(\"Please run KEGG_Data_1.ipynb first to generate this file\")\n",
    "    raise FileNotFoundError(f\"Network data not found: {network_file}\")\n",
    "\n",
    "variant_data = pd.read_csv(network_file, sep='\\t')\n",
    "print(f\"✅ Loaded variant data: {len(variant_data)} entries\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "65dde804",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1449"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(variant_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c042831c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'N00073'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variant_data.iloc[1][\"Network\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92e4699c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import SeqIO\n",
    "import os\n",
    "\n",
    "# Assuming CONFIG is defined somewhere earlier in the code\n",
    "# CONFIG = {'reference_fasta': 'path_to_your_fasta_file'}\n",
    "\n",
    "# Load reference genome sequences\n",
    "fasta_file = CONFIG['reference_fasta']\n",
    "if not os.path.exists(fasta_file):\n",
    "    print(f\"❌ Reference genome file not found: {fasta_file}\")\n",
    "    print(\"Please update CONFIG['reference_fasta'] with the correct path to your reference genome\")\n",
    "    print(\"Download from: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/\")\n",
    "    raise FileNotFoundError(f\"Reference genome not found: {fasta_file}\")\n",
    "\n",
    "print(f\"Loading reference genome from: {fasta_file}\")\n",
    "record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file, \"fasta\"))\n",
    "print(f\"✅ Loaded {len(record_dict)} chromosome sequences\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c2efa951",
   "metadata": {},
   "outputs": [],
   "source": [
    "chromosome_dictionary = {\n",
    "    \"1\": \"NC_000001.11\",\n",
    "    \"2\": \"NC_000002.12\",\n",
    "    \"3\": \"NC_000003.12\",\n",
    "    \"4\": \"NC_000004.12\",\n",
    "    \"5\": \"NC_000005.10\",\n",
    "    \"6\": \"NC_000006.12\",\n",
    "    \"7\": \"NC_000007.14\",\n",
    "    \"9\": \"NC_000009.12\",\n",
    "    \"10\": \"NC_000010.11\",\n",
    "    \"11\": \"NC_000011.10\",\n",
    "    \"12\": \"NC_000012.12\",\n",
    "    \"13\": \"NC_000013.11\",\n",
    "    \"14\": \"NC_000014.9\",\n",
    "    \"15\": \"NC_000015.10\",\n",
    "    \"16\": \"NC_000016.10\",\n",
    "    \"17\": \"NC_000017.11\",\n",
    "    \"18\": \"NC_000018.10\",\n",
    "    \"19\": \"NC_000019.10\",\n",
    "    \"20\": \"NC_000020.11\",\n",
    "    \"21\": \"NC_000021.9\",\n",
    "    \"23\": \"NC_000023.11\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1323f95",
   "metadata": {},
   "source": [
    "### Verification that the reference is present at the exact position I have in my data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0ec0979",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify reference sequences against genome\n",
    "verification_file = CONFIG['verification_file']\n",
    "print(f\"Starting sequence verification - results will be saved to: {verification_file}\")\n",
    "\n",
    "with open(verification_file, \"w\") as f:\n",
    "    for i in range(len(variant_data)):\n",
    "        # ---- Input ----\n",
    "        chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n",
    "        if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n",
    "            start = variant_data.iloc[i]['Start'] - 1\n",
    "        else:\n",
    "            start = variant_data.iloc[i]['Start']\n",
    "        reference_allele = variant_data.iloc[i]['RefAllele']\n",
    "        end = len(reference_allele) + start\n",
    "\n",
    "        chrom_seq = record_dict[chromosome_id].seq\n",
    "\n",
    "        # Adjust for 0-based indexing in Python\n",
    "        genomic_ref = chrom_seq[start: start + len(reference_allele)]\n",
    "\n",
    "        if genomic_ref.upper() != reference_allele.upper():\n",
    "            f.write(f\"⚠️ Warning: Entry number {i} with variant {variant_data.iloc[i]['ID']} expected '{reference_allele}', but found '{genomic_ref}'\\n\")\n",
    "        else:\n",
    "            f.write(f\"✅ Verified: {chromosome_id}:{start}-{end} → '{reference_allele}' matches genome\\n\")\n",
    "        \n",
    "        if (i + 1) % 100 == 0:\n",
    "            print(f\"Verified {i + 1}/{len(variant_data)} variants...\")\n",
    "\n",
    "print(f\"✅ Verification complete. Results saved to: {verification_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39174efe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# Assuming CONFIG is defined somewhere above in the code\n",
    "# CONFIG = {'nt_seq_dir': 'desired/path/to/nt_seq'}\n",
    "\n",
    "# Create nucleotide sequence directory\n",
    "nt_seq_dir = Path(CONFIG['nt_seq_dir'])\n",
    "nt_seq_dir.mkdir(exist_ok=True)\n",
    "print(f\"Created nucleotide sequence directory: {nt_seq_dir}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3065cf9d",
   "metadata": {},
   "source": [
    "### Performing the mutation and saving the reference and variant allele with a 1000 nt window"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6121945f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate nucleotide sequences with mutations\n",
    "nt_seq_dir = CONFIG['nt_seq_dir']\n",
    "window = CONFIG['sequence_window']\n",
    "\n",
    "print(f\"Generating nucleotide sequences with {window}bp windows...\")\n",
    "print(f\"Output directory: {nt_seq_dir}\")\n",
    "\n",
    "for i in range(len(variant_data)):\n",
    "    output_file = f\"{nt_seq_dir}/{variant_data.iloc[i]['Var_ID']}.txt\"\n",
    "    \n",
    "    with open(output_file, \"w\") as f:\n",
    "        # ---- Input ----\n",
    "        chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n",
    "        if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n",
    "            start = variant_data.iloc[i]['Start'] - 1\n",
    "        else:\n",
    "            start = variant_data.iloc[i]['Start']\n",
    "        reference_allele = variant_data.iloc[i]['RefAllele']\n",
    "        variant_allele = variant_data.iloc[i]['AltAllele']\n",
    "\n",
    "        end = len(reference_allele) + start\n",
    "        \n",
    "        chrom_seq = record_dict[chromosome_id].seq\n",
    "\n",
    "        # Extract region\n",
    "        region_start = max(0, start - window)\n",
    "        region_end = end + window\n",
    "\n",
    "        ref_seq = chrom_seq[region_start:region_end]\n",
    "    \n",
    "        if (variant_allele == \"deletion\"):\n",
    "            # Apply mutation\n",
    "            mutated_seq = ref_seq[:window] + ref_seq[window + len(reference_allele):]\n",
    "    \n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n",
    "            f.write(f\"{ref_seq}\\n\")\n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n",
    "            f.write(f\"{mutated_seq}\\n\")\n",
    "        else:\n",
    "            del_len = len(reference_allele)\n",
    "            # Apply mutation\n",
    "            mutated_seq = ref_seq[:window] + variant_allele + ref_seq[window + del_len:]\n",
    "    \n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n",
    "            f.write(f\"{ref_seq}\\n\")\n",
    "            f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n",
    "            f.write(f\"{mutated_seq}\\n\")\n",
    "    \n",
    "    if (i + 1) % 100 == 0:\n",
    "        print(f\"Generated sequences for {i + 1}/{len(variant_data)} variants...\")\n",
    "\n",
    "print(f\"✅ Sequence generation complete. {len(variant_data)} sequence files created in {nt_seq_dir}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a83e9272-b34f-40f3-aedf-3aca0795944f",
   "metadata": {},
   "source": [
    "# Adding in more Variant Data\n",
    "\n",
    "# Data Integration\n",
    "\n",
    "This section merges variant information with the main dataset to create a comprehensive database with all relevant annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "9222e45a-7f9a-4762-8dd8-2cccc654ad3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data = variant_data.merge(variant_info, on='Entry')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae6d44d0-d1f2-4d41-b59d-f8c5888b4914",
   "metadata": {},
   "outputs": [],
   "source": [
    "final_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ab406cd-e9be-4885-811a-f3e2526efe8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save merged variant data\n",
    "output_file = CONFIG['network_data_file']\n",
    "final_data.to_csv(output_file, sep='\\t', header=True, index=False)\n",
    "print(f\"✅ Final variant data with merged information saved to: {output_file}\")\n",
    "print(f\"Dataset contains {len(final_data)} variants with complete information\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ecb5318-ab15-4625-b556-50f8ff39cff3",
   "metadata": {},
   "source": [
    "# Pulling Disease info\n",
    "\n",
    "# Disease Information Processing\n",
    "\n",
    "This section extracts disease identifiers from the variant data and downloads corresponding disease information from KEGG to create human-readable disease names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "b266aa61-7a7f-49c7-a737-578b51b95f32",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "a7aa0417-b1c2-40c9-ad67-f2077d1f1d3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "diseases = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "a0865917-9074-43f4-98a1-74bdb456b2e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(len(final_data)):\n",
    "    diseases.extend(list(ast.literal_eval(final_data['Disease'][i]).keys()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "8b469aee-d8fb-439d-a8bc-e8cb113ddc8f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "disease = set(diseases)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e461b5d7-2200-4dbb-b640-ffd6bf2e3ac2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save disease identifiers to file\n",
    "diseases_file = CONFIG['diseases_file']\n",
    "with open(diseases_file, 'w') as f:\n",
    "    for disease_id in disease:\n",
    "        f.write(f\"{disease_id}\\n\")\n",
    "        \n",
    "print(f\"✅ Saved {len(disease)} unique disease identifiers to: {diseases_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d079c88f-9e8b-4f80-bf6c-5d9a49155b86",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working directory already set - proceeding with disease information retrieval\n",
    "print(\"Starting disease information processing...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "10d814f3-66ec-4580-866e-2cc2fda34109",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "disease          KEGG Disease Database\n",
      "ds               Release 114.0+/04-28, Apr 25\n",
      "                 Kanehisa Laboratories\n",
      "                 2,912 entries\n",
      "\n",
      "linked db        pathway\n",
      "                 brite\n",
      "                 ko\n",
      "                 hsa\n",
      "                 genome\n",
      "                 network\n",
      "                 variant\n",
      "                 drug\n",
      "                 pubmed\n",
      "\n"
     ]
    }
   ],
   "source": [
    "kegg_pull rest info disease"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f095524-d58f-4869-9d1b-5459de85329d",
   "metadata": {},
   "outputs": [],
   "source": [
    "kegg_pull --full-help"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ed80556-4df8-4f0b-8c3e-2a6458c6dd6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "# Assuming CONFIG is defined somewhere earlier in the code\n",
    "# CONFIG = {'disease_info_dir': 'desired/path/to/disease_info'}\n",
    "\n",
    "# Create disease information directory\n",
    "disease_dir = Path(CONFIG['disease_info_dir'])\n",
    "disease_dir.mkdir(exist_ok=True)\n",
    "print(f\"Created disease information directory: {disease_dir}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96851b67-0689-4aa0-9208-a0cdabf95425",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|███████████████████████████████████████████| 44/44 [00:06<00:00,  6.56it/s]\n"
     ]
    }
   ],
   "source": [
    "# Download disease information using kegg_pull\n",
    "diseases_file = CONFIG['diseases_file']\n",
    "disease_output_dir = CONFIG['disease_info_dir']\n",
    "\n",
    "if not os.path.exists(diseases_file):\n",
    "    print(f\"❌ Diseases file not found: {diseases_file}\")\n",
    "    print(\"Please run the previous cells to generate the diseases list\")\n",
    "else:\n",
    "    print(f\"Downloading disease information for entries in: {diseases_file}\")\n",
    "    print(f\"Output directory: {disease_output_dir}\")\n",
    "    # Run the command to download disease information\n",
    "    !cat {diseases_file} | kegg_pull pull entry-ids - --output={disease_output_dir}\n",
    "    print(\"✅ Disease information download complete\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c01f97c-6376-4266-97e9-1d29ef207a51",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Processing disease information files\n",
    "print(\"Parsing disease information from KEGG files...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ea01eac-ee3c-4a5e-9863-2fb061291b45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parse disease information from downloaded files\n",
    "diseases_file = CONFIG['diseases_file']\n",
    "disease_info_dir = Path(CONFIG['disease_info_dir'])\n",
    "\n",
    "# Read all disease identifiers from diseases.txt\n",
    "with open(diseases_file, 'r') as f:\n",
    "    disease_files = [line.strip() for line in f if line.strip()]\n",
    "\n",
    "print(f\"Processing {len(disease_files)} disease information files...\")\n",
    "\n",
    "# Initialize an empty dictionary\n",
    "disease_info = {}\n",
    "\n",
    "# Function to extract the value after a keyword\n",
    "def extract_value(line, key):\n",
    "    return line.split(key, 1)[-1].strip()\n",
    "\n",
    "# Process each disease file\n",
    "processed_count = 0\n",
    "not_found_count = 0\n",
    "\n",
    "for disease_id in disease_files:\n",
    "    file_path = disease_info_dir / f'{disease_id}.txt'\n",
    "\n",
    "    try:\n",
    "        with open(file_path, 'r') as f:\n",
    "            lines = f.readlines()\n",
    "\n",
    "        name = \"\"\n",
    "\n",
    "        for line in lines:\n",
    "            line = line.strip()\n",
    "            if line.startswith(\"NAME\"):\n",
    "                name = extract_value(line, \"NAME\")\n",
    "                break  # No need to check other lines once NAME is found\n",
    "\n",
    "        # Save into dictionary: key = disease_id, value = name\n",
    "        disease_info[disease_id] = name\n",
    "        processed_count += 1\n",
    "        \n",
    "        if processed_count % 50 == 0:\n",
    "            print(f\"Processed {processed_count}/{len(disease_files)} disease files...\")\n",
    "\n",
    "    except FileNotFoundError:\n",
    "        print(f\"[Warning] File not found: {file_path}\")\n",
    "        not_found_count += 1\n",
    "\n",
    "print(f\"✅ Disease processing complete: {processed_count} processed, {not_found_count} not found\")\n",
    "print(f\"Extracted disease information for {len(disease_info)} diseases\")\n",
    "\n",
    "# Optional: Save the dictionary to a file (like JSON)\n",
    "# import json\n",
    "# with open('disease_info.json', 'w') as f:\n",
    "#     json.dump(disease_info, f, indent=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "4dfb4f25-776e-45c6-9eda-457b13cd77bf",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'H00135': 'Krabbe disease;',\n",
       " 'H01398': 'Primary hyperammonemia (Urea cycle disorders)',\n",
       " 'H00032': 'Thyroid cancer',\n",
       " 'H00559': 'von Hippel-Lindau syndrome',\n",
       " 'H00260': 'Pigmented micronodular adrenocortical disease',\n",
       " 'H00038': 'Melanoma',\n",
       " 'H00485': 'Robinow syndrome',\n",
       " 'H00251': 'Thyroid dyshormonogenesis;',\n",
       " 'H00194': 'Lesch-Nyhan syndrome;',\n",
       " 'H00026': 'Endometrial cancer',\n",
       " 'H00020': 'Colorectal cancer',\n",
       " 'H00031': 'Breast cancer',\n",
       " 'H02049': 'Bilateral macronodular adrenal hyperplasia',\n",
       " 'H00042': 'Glioma',\n",
       " 'H00063': 'Spinocerebellar ataxia (SCA)',\n",
       " 'H00195': 'Adenine phosphoribosyltransferase deficiency;',\n",
       " 'H00033': 'Adrenal carcinoma',\n",
       " 'H00048': 'Hepatocellular carcinoma;',\n",
       " 'H01522': 'Zollinger-Ellison syndrome',\n",
       " 'H00019': 'Pancreatic cancer',\n",
       " 'H00004': 'Chronic myeloid leukemia',\n",
       " 'H00058': 'Amyotrophic lateral sclerosis (ALS);',\n",
       " 'H00022': 'Bladder cancer',\n",
       " 'H00056': 'Alzheimer disease;',\n",
       " 'H01032': 'N-acetylglutamate synthase deficiency',\n",
       " 'H00247': 'Multiple endocrine neoplasia syndrome;',\n",
       " 'H00246': 'Primary hyperparathyroidism;',\n",
       " 'H00039': 'Basal cell carcinoma',\n",
       " 'H00021': 'Renal cell carcinoma',\n",
       " 'H00013': 'Small cell lung cancer',\n",
       " 'H00003': 'Acute myeloid leukemia',\n",
       " 'H00018': 'Gastric cancer',\n",
       " 'H01603': 'Primary aldosteronism',\n",
       " 'H00061': 'Prion disease',\n",
       " 'H00014': 'Non-small cell lung cancer',\n",
       " 'H00423': 'Sphingolipidosis',\n",
       " 'H00024': 'Prostate cancer',\n",
       " 'H01102': 'Pituitary adenomas',\n",
       " 'H00034': 'Carcinoid',\n",
       " 'H00059': 'Huntington disease',\n",
       " 'H01431': 'Cushing syndrome',\n",
       " 'H00057': 'Parkinson disease',\n",
       " 'H00126': 'Gaucher disease',\n",
       " 'H02221': 'Methylmalonic aciduria and homocystinuria'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "disease_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "458ca725-03e8-4b2a-98e7-f418f40190fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reload variant data for disease processing\n",
    "variant_data = pd.read_csv(CONFIG['network_data_file'], sep='\\t')\n",
    "print(f\"Processing disease information for {len(variant_data)} variants\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e86ddd65-cbde-42d3-be6f-cbc54e2dda06",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "\n",
    "# Assume disease_info is already a dictionary {\"D001\": \"Cancer\", \"D002\": \"Diabetes\", ...}\n",
    "\n",
    "# Create a new column to store disease dictionaries\n",
    "variant_data[\"Disease_Names\"] = \"\"\n",
    "\n",
    "# Process each row\n",
    "for idx, row in variant_data.iterrows():\n",
    "    try:\n",
    "        # Convert the string dictionary into a real dictionary\n",
    "        disease_dict = ast.literal_eval(row[\"Disease\"])\n",
    "\n",
    "        # Get the disease IDs (keys)\n",
    "        disease_ids = disease_dict.keys()\n",
    "\n",
    "        # Build a new dictionary: {disease_id: disease_name}\n",
    "        disease_names_dict = {did: disease_info.get(did, \"\") for did in disease_ids}\n",
    "\n",
    "        # Save it into the Disease_Names column\n",
    "        variant_data.at[idx, \"Disease_Names\"] = disease_names_dict\n",
    "\n",
    "    except (ValueError, SyntaxError):\n",
    "        print(f\"[Warning] Couldn't parse disease info at row {idx}\")\n",
    "        variant_data.at[idx, \"Disease_Names\"] = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06a29f96-56b2-46b2-897e-d7006dd0ae52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save updated variant data with disease names\n",
    "output_file = CONFIG['network_data_file']\n",
    "variant_data.to_csv(output_file, sep='\\t', header=True, index=False)\n",
    "print(f\"✅ Updated variant data saved to: {output_file}\")\n",
    "print(f\"Dataset now includes disease names for {len(variant_data)} variants\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "674b4a4a-93ab-4fdd-af73-cf0351381fe6",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}