{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83c9cd1f",
   "metadata": {},
   "source": [
    "## Setup and Data Preparation\n",
    "\n",
    "Initial setup steps to prepare the working environment and extract ClinVar data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a36253-9050-4d58-96cd-8238aae51e0e",
   "metadata": {},
   "source": [
    "# ClinVar Coding Variants Data Processing\n",
    "\n",
    "This notebook processes ClinVar coding variants data by extracting additional information including gene names, gene IDs, and associated diseases from ClinVar XML records.\n",
    "\n",
    "## Overview\n",
    "\n",
    "The workflow includes:\n",
    "1. **Data Extraction**: Filter ClinVar entries from VEP-annotated pathogenic coding variants\n",
    "2. **XML Processing**: Parse ClinVar XML records to extract gene and disease information\n",
    "3. **Gene Annotation**: Map gene IDs to gene names using NCBI Entrez utilities\n",
    "4. **Data Integration**: Combine all information into a comprehensive dataset\n",
    "\n",
    "## Requirements\n",
    "\n",
    "- Python 3.7+\n",
    "- pandas library\n",
    "- xml.etree.ElementTree (built-in)\n",
    "- NCBI Entrez Direct tools (for gene name mapping)\n",
    "- Input data: VEP-annotated pathogenic coding variants CSV file\n",
    "\n",
    "## Data Structure\n",
    "\n",
    "The processing creates a dataset with the following key columns:\n",
    "- Variant information (chromosome, position, alleles)\n",
    "- ClinVar ID and significance\n",
    "- Gene symbols and IDs\n",
    "- Associated disease/phenotype information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb351234-50a3-4061-81ce-bdce5343e790",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create working directory for ClinVar data processing\n",
    "import os\n",
    "os.makedirs('clinvar', exist_ok=True)\n",
    "print(\"✅ Created 'clinvar' directory\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "443ccab8-50a1-45ae-950c-8425eb318e93",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Navigate to clinvar directory\n",
    "os.chdir('clinvar')\n",
    "print(f\"📁 Current working directory: {os.getcwd()}\")\n",
    "\n",
    "with open('vep_pathogenic_coding.csv') as infile, open('clinvar_coding_raw.csv', 'w') as outfile:\n",
    "    for line in infile:\n",
    "        if 'ClinVar' in line:\n",
    "            outfile.write(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1f92675-b85c-4baa-8680-9c3776e04ac9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract ClinVar entries from VEP-annotated pathogenic coding variants\n",
    "# Note: Update the input file path to match your data location\n",
    "input_file = \"../data/vep_pathogenic_coding.csv\"  # Adjust path as needed\n",
    "output_file = \"clinvar_coding_raw.csv\"\n",
    "\n",
    "# Use shell command to filter ClinVar entries\n",
    "import subprocess\n",
    "try:\n",
    "    result = subprocess.run(\n",
    "        [\"grep\", \"ClinVar\", input_file],\n",
    "        capture_output=True,\n",
    "        text=True,\n",
    "        check=True\n",
    "    )\n",
    "    \n",
    "    with open(output_file, 'w') as f:\n",
    "        f.write(result.stdout)\n",
    "    \n",
    "    print(f\"✅ Extracted ClinVar entries to {output_file}\")\n",
    "    print(f\"📊 Found {len(result.stdout.strip().split('\\n'))} ClinVar entries\")\n",
    "    \n",
    "except subprocess.CalledProcessError:\n",
    "    print(f\"❌ Error: Could not find ClinVar entries in {input_file}\")\n",
    "    print(\"Please ensure the input file exists and contains ClinVar annotations\")\n",
    "except FileNotFoundError:\n",
    "    print(f\"❌ Error: Input file {input_file} not found\")\n",
    "    print(\"Please update the input_file path to point to your VEP-annotated data\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e560308-135b-4189-9146-ff50845839a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract ClinVar IDs from the filtered data (assuming ID is in column 8)\n",
    "# Note: Adjust column number if your data structure is different\n",
    "import pandas as pd\n",
    "\n",
    "try:\n",
    "    # Read the raw ClinVar data to determine structure\n",
    "    df_temp = pd.read_csv(\"clinvar_coding_raw.csv\")\n",
    "    print(f\"📋 Data shape: {df_temp.shape}\")\n",
    "    print(f\"📋 Columns: {list(df_temp.columns)}\")\n",
    "    \n",
    "    # Extract ClinVar IDs (adjust column index as needed)\n",
    "    # Column 8 corresponds to index 7 in Python (0-based)\n",
    "    if df_temp.shape[1] >= 8:\n",
    "        clinvar_ids = df_temp.iloc[:, 7]  # 8th column (0-based index 7)\n",
    "        \n",
    "        # Save IDs to file\n",
    "        with open(\"Clinvar_ID.txt\", 'w') as f:\n",
    "            for id_val in clinvar_ids:\n",
    "                if pd.notna(id_val):\n",
    "                    f.write(f\"{id_val}\\n\")\n",
    "        \n",
    "        print(f\"✅ Extracted {len(clinvar_ids.dropna())} ClinVar IDs to Clinvar_ID.txt\")\n",
    "    else:\n",
    "        print(f\"❌ Error: Expected at least 8 columns, found {df_temp.shape[1]}\")\n",
    "        \n",
    "except FileNotFoundError:\n",
    "    print(\"❌ Error: clinvar_coding_raw.csv not found\")\n",
    "    print(\"Please run the previous cell first to extract ClinVar data\")\n",
    "except Exception as e:\n",
    "    print(f\"❌ Error processing ClinVar data: {e}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53b0dfd8-8d49-4c3f-adb4-4c6bfbffcfa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "chmod +x Clinvar_esearch.sh\n",
    "\n",
    "## XML Data Retrieval\n",
    "\n",
    "**Note**: This step requires creating a shell script (`Clinvar_esearch.sh`) to fetch XML data from NCBI.\n",
    "\n",
    "The script should:\n",
    "1. Read ClinVar IDs from `Clinvar_ID.txt`\n",
    "2. Use NCBI Entrez Direct tools to fetch XML records\n",
    "3. Save XML files in a `data/` subdirectory\n",
    "\n",
    "Example script content:\n",
    "```bash\n",
    "#!/bin/bash\n",
    "mkdir -p data\n",
    "while read -r id; do\n",
    "    esearch -db clinvar -query \"$id\" | efetch -format xml > \"data/${id}.xml\"\n",
    "    echo \"Downloaded XML for ClinVar ID: $id\"\n",
    "done < Clinvar_ID.txt\n",
    "```\n",
    "\n",
    "**Prerequisites**: Install NCBI Entrez Direct tools:\n",
    "- macOS: `brew install brewsci/bio/edirect`\n",
    "- Linux: Follow NCBI EDirect installation guide"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0755ad6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parsing XML for Gene and Disease\n",
    "\n",
    "# Make the ClinVar search script executable and run it\n",
    "# Note: This assumes you have created the Clinvar_esearch.sh script\n",
    "\n",
    "import os\n",
    "import subprocess\n",
    "\n",
    "script_path = \"Clinvar_esearch.sh\"\n",
    "\n",
    "if os.path.exists(script_path):\n",
    "    # Make script executable\n",
    "    os.chmod(script_path, 0o755)\n",
    "    print(f\"✅ Made {script_path} executable\")\n",
    "    \n",
    "    # Optionally run the script (uncomment if you want to execute automatically)\n",
    "    # print(\"🚀 Running ClinVar XML download script...\")\n",
    "    # result = subprocess.run([f\"./{script_path}\"], capture_output=True, text=True)\n",
    "    # if result.returncode == 0:\n",
    "    #     print(\"✅ XML download completed successfully\")\n",
    "    # else:\n",
    "    #     print(f\"❌ Script execution failed: {result.stderr}\")\n",
    "else:\n",
    "    print(f\"⚠️ Warning: {script_path} not found\")\n",
    "    print(\"Please create this script manually to download ClinVar XML data\")\n",
    "    print(\"See the documentation in the previous cell for script template\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d21a188b-a0dc-4af2-9b71-5a44d8cd4673",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import xml.etree.ElementTree as ET\n",
    "import json\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "print(\"📚 Libraries imported successfully\")\n",
    "print(f\"📁 Current directory: {os.getcwd()}\")\n",
    "print(f\"📊 Pandas version: {pd.__version__}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1365615b-ee81-4df0-9fca-df001e9f01d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the raw ClinVar data\n",
    "try:\n",
    "    clinvar_raw = pd.read_csv(\"clinvar_coding_raw.csv\")\n",
    "    print(f\"✅ Loaded ClinVar data: {clinvar_raw.shape[0]} rows, {clinvar_raw.shape[1]} columns\")\n",
    "    print(f\"📋 Columns: {list(clinvar_raw.columns)[:10]}\")  # Show first 10 columns\n",
    "    \n",
    "except FileNotFoundError:\n",
    "    print(\"❌ Error: clinvar_coding_raw.csv not found\")\n",
    "    print(\"Please run the data extraction steps first\")\n",
    "    clinvar_raw = None\n",
    "except Exception as e:\n",
    "    print(f\"❌ Error loading data: {e}\")\n",
    "    clinvar_raw = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7144ddf2-abf7-4680-b578-d4bd4b7195ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove unnecessary columns to streamline the dataset\n",
    "# Note: Adjust column names based on your actual data structure\n",
    "\n",
    "if clinvar_raw is not None:\n",
    "    columns_to_remove = [\n",
    "        \"GENOMIC_MUTATION_ID\", \"N_SAMPLES\", \"TOTAL_SAMPLES\", \"FREQ\", \n",
    "        \"OMIM\", \"PMID\", \"AC\", \"AN\", \"AF\", \"MAF\", \"MAC\"\n",
    "    ]\n",
    "    \n",
    "    # Only remove columns that actually exist in the dataset\n",
    "    existing_columns = [col for col in columns_to_remove if col in clinvar_raw.columns]\n",
    "    missing_columns = [col for col in columns_to_remove if col not in clinvar_raw.columns]\n",
    "    \n",
    "    if existing_columns:\n",
    "        clinvar_raw = clinvar_raw.drop(columns=existing_columns)\n",
    "        print(f\"✅ Removed {len(existing_columns)} columns: {existing_columns}\")\n",
    "    \n",
    "    if missing_columns:\n",
    "        print(f\"ℹ️ Columns not found (skipped): {missing_columns}\")\n",
    "    \n",
    "    print(f\"📊 Remaining columns: {clinvar_raw.shape[1]}\")\n",
    "else:\n",
    "    print(\"⚠️ Skipping column removal - data not loaded\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbffd3cd-7df3-43e2-8d73-01f54e8d1da6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHROM</th>\n",
       "      <th>POS</th>\n",
       "      <th>REF</th>\n",
       "      <th>ALT</th>\n",
       "      <th>LABEL</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>CONSEQUENCE</th>\n",
       "      <th>ID</th>\n",
       "      <th>REVIEW_STATUS</th>\n",
       "      <th>GENE</th>\n",
       "      <th>split</th>\n",
       "      <th>INT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050449</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1284257</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050575</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>18241</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1213738</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>96692</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1232279</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>initiatior_codon_variant,missense_variant</td>\n",
       "      <td>60484</td>\n",
       "      <td>criteria_provided,_multiple_submitters,_no_con...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22249</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787412</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9747</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22250</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787426</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9739</td>\n",
       "      <td>criteria_provided,_single_submitter</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22251</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787515</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>492908</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22252</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787551</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9754</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22253</th>\n",
       "      <td>chrY</td>\n",
       "      <td>7063898</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>625467</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>NaN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>22254 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      CHROM      POS REF ALT       LABEL   SOURCE  \\\n",
       "0      chr1   976215   A   G  Pathogenic  ClinVar   \n",
       "1      chr1  1050449   G   A  Pathogenic  ClinVar   \n",
       "2      chr1  1050575   G   C  Pathogenic  ClinVar   \n",
       "3      chr1  1213738   G   A  Pathogenic  ClinVar   \n",
       "4      chr1  1232279   A   G  Pathogenic  ClinVar   \n",
       "...     ...      ...  ..  ..         ...      ...   \n",
       "22249  chrY  2787412   C   T  Pathogenic  ClinVar   \n",
       "22250  chrY  2787426   C   G  Pathogenic  ClinVar   \n",
       "22251  chrY  2787515   C   A  Pathogenic  ClinVar   \n",
       "22252  chrY  2787551   C   T  Pathogenic  ClinVar   \n",
       "22253  chrY  7063898   A   T  Pathogenic  ClinVar   \n",
       "\n",
       "                                     CONSEQUENCE       ID  \\\n",
       "0                               missense_variant  1320032   \n",
       "1                               missense_variant  1284257   \n",
       "2                               missense_variant    18241   \n",
       "3                               missense_variant    96692   \n",
       "4      initiatior_codon_variant,missense_variant    60484   \n",
       "...                                          ...      ...   \n",
       "22249                           missense_variant     9747   \n",
       "22250                           missense_variant     9739   \n",
       "22251                           missense_variant   492908   \n",
       "22252                           missense_variant     9754   \n",
       "22253                           missense_variant   625467   \n",
       "\n",
       "                                           REVIEW_STATUS  GENE  split  \\\n",
       "0                         no_assertion_criteria_provided   NaN  train   \n",
       "1                         no_assertion_criteria_provided   NaN  train   \n",
       "2                         no_assertion_criteria_provided   NaN  train   \n",
       "3                         no_assertion_criteria_provided   NaN  train   \n",
       "4      criteria_provided,_multiple_submitters,_no_con...   NaN  train   \n",
       "...                                                  ...   ...    ...   \n",
       "22249                     no_assertion_criteria_provided   NaN  train   \n",
       "22250                criteria_provided,_single_submitter   NaN  train   \n",
       "22251                     no_assertion_criteria_provided   NaN  train   \n",
       "22252                     no_assertion_criteria_provided   NaN  train   \n",
       "22253                     no_assertion_criteria_provided   NaN  train   \n",
       "\n",
       "       INT_LABEL  \n",
       "0              1  \n",
       "1              1  \n",
       "2              1  \n",
       "3              1  \n",
       "4              1  \n",
       "...          ...  \n",
       "22249          1  \n",
       "22250          1  \n",
       "22251          1  \n",
       "22252          1  \n",
       "22253          1  \n",
       "\n",
       "[22254 rows x 12 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_raw\n",
    "\n",
    "# Preview the cleaned dataset\n",
    "if clinvar_raw is not None:\n",
    "    print(f\"📊 Dataset shape: {clinvar_raw.shape}\")\n",
    "    print(f\"📋 Column names: {list(clinvar_raw.columns)}\")\n",
    "    print(\"\\n🔍 First few rows:\")\n",
    "    display(clinvar_raw.head())\n",
    "    \n",
    "    # Check for any null values\n",
    "    null_counts = clinvar_raw.isnull().sum()\n",
    "    if null_counts.sum() > 0:\n",
    "        print(\"\\n⚠️ Null values found:\")\n",
    "        print(null_counts[null_counts > 0])\n",
    "else:\n",
    "    print(\"❌ No data to display\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e380634b-0c22-4d1e-8520-6fc5728e7de5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add new columns for gene information\n",
    "if clinvar_raw is not None:\n",
    "    clinvar_raw['GENE_ID'] = \"\"\n",
    "    clinvar_raw['GENE'] = \"\"\n",
    "    print(\"✅ Added GENE_ID and GENE columns\")\n",
    "    print(f\"📊 Updated dataset shape: {clinvar_raw.shape}\")\n",
    "else:\n",
    "    print(\"⚠️ Cannot add columns - data not loaded\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92b159f5-694d-4ee4-9616-1ebf00f71904",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHROM</th>\n",
       "      <th>POS</th>\n",
       "      <th>REF</th>\n",
       "      <th>ALT</th>\n",
       "      <th>LABEL</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>CONSEQUENCE</th>\n",
       "      <th>ID</th>\n",
       "      <th>REVIEW_STATUS</th>\n",
       "      <th>GENE</th>\n",
       "      <th>split</th>\n",
       "      <th>INT_LABEL</th>\n",
       "      <th>GENE_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050449</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1284257</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050575</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>18241</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1213738</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>96692</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1232279</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>initiatior_codon_variant,missense_variant</td>\n",
       "      <td>60484</td>\n",
       "      <td>criteria_provided,_multiple_submitters,_no_con...</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22249</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787412</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9747</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22250</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787426</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9739</td>\n",
       "      <td>criteria_provided,_single_submitter</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22251</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787515</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>492908</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22252</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787551</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9754</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22253</th>\n",
       "      <td>chrY</td>\n",
       "      <td>7063898</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>625467</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td></td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>22254 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      CHROM      POS REF ALT       LABEL   SOURCE  \\\n",
       "0      chr1   976215   A   G  Pathogenic  ClinVar   \n",
       "1      chr1  1050449   G   A  Pathogenic  ClinVar   \n",
       "2      chr1  1050575   G   C  Pathogenic  ClinVar   \n",
       "3      chr1  1213738   G   A  Pathogenic  ClinVar   \n",
       "4      chr1  1232279   A   G  Pathogenic  ClinVar   \n",
       "...     ...      ...  ..  ..         ...      ...   \n",
       "22249  chrY  2787412   C   T  Pathogenic  ClinVar   \n",
       "22250  chrY  2787426   C   G  Pathogenic  ClinVar   \n",
       "22251  chrY  2787515   C   A  Pathogenic  ClinVar   \n",
       "22252  chrY  2787551   C   T  Pathogenic  ClinVar   \n",
       "22253  chrY  7063898   A   T  Pathogenic  ClinVar   \n",
       "\n",
       "                                     CONSEQUENCE       ID  \\\n",
       "0                               missense_variant  1320032   \n",
       "1                               missense_variant  1284257   \n",
       "2                               missense_variant    18241   \n",
       "3                               missense_variant    96692   \n",
       "4      initiatior_codon_variant,missense_variant    60484   \n",
       "...                                          ...      ...   \n",
       "22249                           missense_variant     9747   \n",
       "22250                           missense_variant     9739   \n",
       "22251                           missense_variant   492908   \n",
       "22252                           missense_variant     9754   \n",
       "22253                           missense_variant   625467   \n",
       "\n",
       "                                           REVIEW_STATUS GENE  split  \\\n",
       "0                         no_assertion_criteria_provided       train   \n",
       "1                         no_assertion_criteria_provided       train   \n",
       "2                         no_assertion_criteria_provided       train   \n",
       "3                         no_assertion_criteria_provided       train   \n",
       "4      criteria_provided,_multiple_submitters,_no_con...       train   \n",
       "...                                                  ...  ...    ...   \n",
       "22249                     no_assertion_criteria_provided       train   \n",
       "22250                criteria_provided,_single_submitter       train   \n",
       "22251                     no_assertion_criteria_provided       train   \n",
       "22252                     no_assertion_criteria_provided       train   \n",
       "22253                     no_assertion_criteria_provided       train   \n",
       "\n",
       "       INT_LABEL GENE_ID  \n",
       "0              1          \n",
       "1              1          \n",
       "2              1          \n",
       "3              1          \n",
       "4              1          \n",
       "...          ...     ...  \n",
       "22249          1          \n",
       "22250          1          \n",
       "22251          1          \n",
       "22252          1          \n",
       "22253          1          \n",
       "\n",
       "[22254 rows x 13 columns]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_raw\n",
    "\n",
    "# Display updated dataset with new columns\n",
    "if clinvar_raw is not None:\n",
    "    print(f\"📊 Dataset with new columns: {clinvar_raw.shape}\")\n",
    "    print(f\"📋 All columns: {list(clinvar_raw.columns)}\")\n",
    "    display(clinvar_raw.head())\n",
    "else:\n",
    "    print(\"❌ No data to display\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f36db716-392a-46a8-a404-d78165a4623c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import xml.etree.ElementTree as ET\n",
    "import os\n",
    "\n",
    "# Parse ClinVar XML files to extract gene information\n",
    "# This processes each ClinVar ID and extracts gene symbols and IDs from XML records\n",
    "\n",
    "if clinvar_raw is not None:\n",
    "    # Load list of ClinVar IDs\n",
    "    try:\n",
    "        with open(\"Clinvar_ID.txt\", \"r\") as f:\n",
    "            clinvar_ids = [line.strip() for line in f if line.strip()]\n",
    "        \n",
    "        print(f\"📋 Processing {len(clinvar_ids)} ClinVar IDs\")\n",
    "        \n",
    "        processed_count = 0\n",
    "        error_count = 0\n",
    "        \n",
    "        # Process each ClinVar ID\n",
    "        for i, clinvar_id in enumerate(clinvar_ids):\n",
    "            if i % 100 == 0:  # Progress indicator\n",
    "                print(f\"📊 Processing ID {i+1}/{len(clinvar_ids)}...\")\n",
    "            \n",
    "            try:\n",
    "                id_int = int(clinvar_id)\n",
    "                xml_path = f'data/{clinvar_id}.xml'\n",
    "                \n",
    "                # Check if XML file exists\n",
    "                if not os.path.exists(xml_path):\n",
    "                    print(f\"⚠️ XML file not found: {xml_path}\")\n",
    "                    continue\n",
    "                \n",
    "                # Parse XML file\n",
    "                with open(xml_path, 'r', encoding='utf-8') as file:\n",
    "                    tree = ET.parse(file)\n",
    "                    root = tree.getroot()\n",
    "                \n",
    "                # Check for error in XML\n",
    "                error_element = root.find(\".//error\")\n",
    "                if error_element is not None:\n",
    "                    # Remove entries with errors\n",
    "                    clinvar_raw = clinvar_raw[clinvar_raw[\"ID\"] != id_int]\n",
    "                    error_count += 1\n",
    "                    continue\n",
    "                \n",
    "                # Extract gene information\n",
    "                gene_names = []\n",
    "                gene_ids = []\n",
    "                \n",
    "                for gene in root.findall(\".//genes/gene\"):\n",
    "                    symbol = gene.findtext(\"symbol\")\n",
    "                    gene_id_data = gene.findtext(\"GeneID\")\n",
    "                    \n",
    "                    if symbol:\n",
    "                        gene_names.append(symbol)\n",
    "                    if gene_id_data:\n",
    "                        gene_ids.append(gene_id_data)\n",
    "                \n",
    "                # Join multiple entries with commas\n",
    "                gene_name_str = \", \".join(gene_names) if gene_names else \"\"\n",
    "                gene_id_str = \", \".join(gene_ids) if gene_ids else \"\"\n",
    "                \n",
    "                # Update DataFrame\n",
    "                mask = clinvar_raw[\"ID\"] == id_int\n",
    "                if mask.any():\n",
    "                    clinvar_raw.loc[mask, \"GENE\"] = gene_name_str\n",
    "                    clinvar_raw.loc[mask, \"GENE_ID\"] = gene_id_str\n",
    "                    processed_count += 1\n",
    "                \n",
    "            except ET.ParseError as e:\n",
    "                print(f\"⚠️ XML parsing error for {clinvar_id}: {e}\")\n",
    "                error_count += 1\n",
    "            except ValueError as e:\n",
    "                print(f\"⚠️ Invalid ClinVar ID {clinvar_id}: {e}\")\n",
    "                error_count += 1\n",
    "            except Exception as e:\n",
    "                print(f\"⚠️ Unexpected error processing {clinvar_id}: {e}\")\n",
    "                error_count += 1\n",
    "        \n",
    "        print(f\"\\n✅ Processing complete:\")\n",
    "        print(f\"   📊 Successfully processed: {processed_count}\")\n",
    "        print(f\"   ❌ Errors encountered: {error_count}\")\n",
    "        print(f\"   📋 Final dataset shape: {clinvar_raw.shape}\")\n",
    "        \n",
    "    except FileNotFoundError:\n",
    "        print(\"❌ Error: Clinvar_ID.txt not found\")\n",
    "        print(\"Please run the ID extraction step first\")\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Error during XML processing: {e}\")\n",
    "else:\n",
    "    print(\"⚠️ Cannot process XML files - ClinVar data not loaded\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae0c9d8b-1b12-40a4-82ec-c3452e9dda90",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHROM</th>\n",
       "      <th>POS</th>\n",
       "      <th>REF</th>\n",
       "      <th>ALT</th>\n",
       "      <th>LABEL</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>CONSEQUENCE</th>\n",
       "      <th>ID</th>\n",
       "      <th>REVIEW_STATUS</th>\n",
       "      <th>GENE</th>\n",
       "      <th>split</th>\n",
       "      <th>INT_LABEL</th>\n",
       "      <th>GENE_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>PERM1</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>84808</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050449</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1284257</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050575</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>18241</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1213738</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>96692</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>TNFRSF4</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>7293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1232279</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>initiatior_codon_variant,missense_variant</td>\n",
       "      <td>60484</td>\n",
       "      <td>criteria_provided,_multiple_submitters,_no_con...</td>\n",
       "      <td>B3GALT6</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>126792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22249</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787412</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9747</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22250</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787426</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9739</td>\n",
       "      <td>criteria_provided,_single_submitter</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22251</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787515</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>492908</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22252</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787551</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9754</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22253</th>\n",
       "      <td>chrY</td>\n",
       "      <td>7063898</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>625467</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>LOC126057105, TBL1Y</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>126057105, 90665</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>22150 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      CHROM      POS REF ALT       LABEL   SOURCE  \\\n",
       "0      chr1   976215   A   G  Pathogenic  ClinVar   \n",
       "1      chr1  1050449   G   A  Pathogenic  ClinVar   \n",
       "2      chr1  1050575   G   C  Pathogenic  ClinVar   \n",
       "3      chr1  1213738   G   A  Pathogenic  ClinVar   \n",
       "4      chr1  1232279   A   G  Pathogenic  ClinVar   \n",
       "...     ...      ...  ..  ..         ...      ...   \n",
       "22249  chrY  2787412   C   T  Pathogenic  ClinVar   \n",
       "22250  chrY  2787426   C   G  Pathogenic  ClinVar   \n",
       "22251  chrY  2787515   C   A  Pathogenic  ClinVar   \n",
       "22252  chrY  2787551   C   T  Pathogenic  ClinVar   \n",
       "22253  chrY  7063898   A   T  Pathogenic  ClinVar   \n",
       "\n",
       "                                     CONSEQUENCE       ID  \\\n",
       "0                               missense_variant  1320032   \n",
       "1                               missense_variant  1284257   \n",
       "2                               missense_variant    18241   \n",
       "3                               missense_variant    96692   \n",
       "4      initiatior_codon_variant,missense_variant    60484   \n",
       "...                                          ...      ...   \n",
       "22249                           missense_variant     9747   \n",
       "22250                           missense_variant     9739   \n",
       "22251                           missense_variant   492908   \n",
       "22252                           missense_variant     9754   \n",
       "22253                           missense_variant   625467   \n",
       "\n",
       "                                           REVIEW_STATUS                 GENE  \\\n",
       "0                         no_assertion_criteria_provided                PERM1   \n",
       "1                         no_assertion_criteria_provided                 AGRN   \n",
       "2                         no_assertion_criteria_provided                 AGRN   \n",
       "3                         no_assertion_criteria_provided              TNFRSF4   \n",
       "4      criteria_provided,_multiple_submitters,_no_con...              B3GALT6   \n",
       "...                                                  ...                  ...   \n",
       "22249                     no_assertion_criteria_provided                  SRY   \n",
       "22250                criteria_provided,_single_submitter                  SRY   \n",
       "22251                     no_assertion_criteria_provided                  SRY   \n",
       "22252                     no_assertion_criteria_provided                  SRY   \n",
       "22253                     no_assertion_criteria_provided  LOC126057105, TBL1Y   \n",
       "\n",
       "       split  INT_LABEL           GENE_ID  \n",
       "0      train          1             84808  \n",
       "1      train          1            375790  \n",
       "2      train          1            375790  \n",
       "3      train          1              7293  \n",
       "4      train          1            126792  \n",
       "...      ...        ...               ...  \n",
       "22249  train          1              6736  \n",
       "22250  train          1              6736  \n",
       "22251  train          1              6736  \n",
       "22252  train          1              6736  \n",
       "22253  train          1  126057105, 90665  \n",
       "\n",
       "[22150 rows x 13 columns]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_raw\n",
    "\n",
    "# Display the dataset with extracted gene information\n",
    "if clinvar_raw is not None:\n",
    "    print(f\"📊 Dataset after gene extraction: {clinvar_raw.shape}\")\n",
    "    \n",
    "    # Show statistics\n",
    "    gene_filled = (clinvar_raw['GENE'] != '').sum()\n",
    "    gene_id_filled = (clinvar_raw['GENE_ID'] != '').sum()\n",
    "    \n",
    "    print(f\"📋 Entries with gene names: {gene_filled} ({gene_filled/len(clinvar_raw)*100:.1f}%)\")\n",
    "    print(f\"📋 Entries with gene IDs: {gene_id_filled} ({gene_id_filled/len(clinvar_raw)*100:.1f}%)\")\n",
    "    \n",
    "    # Show sample data\n",
    "    display(clinvar_raw.head(10))\n",
    "else:\n",
    "    print(\"❌ No data to display\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b76910bd-aa86-4943-a0f2-dcf9756ad81d",
   "metadata": {},
   "source": [
    "## Disease/Phenotype Information Extraction\n",
    "\n",
    "This section extracts disease and phenotype information from the ClinVar XML records. Each variant may be associated with multiple diseases, so the data is expanded to create one row per variant-disease combination.\n",
    "\n",
    "### Putting in the Disease Name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54ccd972-5804-4d63-9012-5531034d2b60",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract disease/phenotype information from ClinVar XML files\n",
    "# This creates multiple rows for variants associated with multiple diseases\n",
    "\n",
    "if clinvar_raw is not None:\n",
    "    try:\n",
    "        # Load ClinVar IDs\n",
    "        with open(\"Clinvar_ID.txt\", \"r\") as f:\n",
    "            clinvar_ids = [line.strip() for line in f if line.strip()]\n",
    "        \n",
    "        print(f\"📋 Processing {len(clinvar_ids)} ClinVar IDs for disease extraction\")\n",
    "        \n",
    "        # Ensure ID column is integer type\n",
    "        clinvar_raw[\"ID\"] = clinvar_raw[\"ID\"].astype(int)\n",
    "        \n",
    "        # Create new DataFrame to store expanded data\n",
    "        clinvar_data = pd.DataFrame(columns=clinvar_raw.columns.tolist() + [\"Disease\"])\n",
    "        \n",
    "        processed_count = 0\n",
    "        disease_count = 0\n",
    "        \n",
    "        # Process each ClinVar ID\n",
    "        for i, clinvar_id in enumerate(clinvar_ids):\n",
    "            if i % 100 == 0:  # Progress indicator\n",
    "                print(f\"📊 Processing disease info {i+1}/{len(clinvar_ids)}...\")\n",
    "            \n",
    "            try:\n",
    "                id_int = int(clinvar_id)\n",
    "                xml_path = f\"data/{clinvar_id}.xml\"\n",
    "                \n",
    "                if not os.path.exists(xml_path):\n",
    "                    continue\n",
    "                \n",
    "                # Parse XML\n",
    "                tree = ET.parse(xml_path)\n",
    "                root = tree.getroot()\n",
    "                \n",
    "                # Extract all trait names (diseases/phenotypes)\n",
    "                trait_names = []\n",
    "                for trait in root.findall(\".//trait\"):\n",
    "                    trait_name = trait.findtext(\"trait_name\")\n",
    "                    if trait_name:\n",
    "                        trait_names.append(trait_name)\n",
    "                \n",
    "                # Filter out 'not provided' if other traits exist\n",
    "                filtered_traits = [t for t in trait_names if t.lower() != \"not provided\"]\n",
    "                if not filtered_traits and \"not provided\" in [t.lower() for t in trait_names]:\n",
    "                    filtered_traits = [\"not provided\"]\n",
    "                \n",
    "                # If no traits found, use empty string\n",
    "                if not filtered_traits:\n",
    "                    filtered_traits = [\"\"]\n",
    "                \n",
    "                # Create one row for each disease/trait\n",
    "                base_row = clinvar_raw[clinvar_raw[\"ID\"] == id_int]\n",
    "                if not base_row.empty:\n",
    "                    for disease_name in filtered_traits:\n",
    "                        new_row = base_row.copy()\n",
    "                        new_row[\"Disease\"] = disease_name\n",
    "                        clinvar_data = pd.concat([clinvar_data, new_row], ignore_index=True)\n",
    "                        disease_count += 1\n",
    "                    processed_count += 1\n",
    "                \n",
    "            except ET.ParseError as e:\n",
    "                print(f\"⚠️ XML parsing error for {clinvar_id}: {e}\")\n",
    "            except Exception as e:\n",
    "                print(f\"⚠️ Error processing {clinvar_id}: {e}\")\n",
    "        \n",
    "        print(f\"\\n✅ Disease extraction complete:\")\n",
    "        print(f\"   📊 Variants processed: {processed_count}\")\n",
    "        print(f\"   🔬 Total variant-disease pairs: {disease_count}\")\n",
    "        print(f\"   📋 Final dataset shape: {clinvar_data.shape}\")\n",
    "        \n",
    "        # Save intermediate results\n",
    "        clinvar_data.to_csv(\"clinvar_with_disease.csv\", sep='\\t', index=False)\n",
    "        print(\"💾 Saved results to clinvar_with_disease.csv\")\n",
    "        \n",
    "    except FileNotFoundError:\n",
    "        print(\"❌ Error: Required files not found\")\n",
    "        print(\"Please ensure Clinvar_ID.txt exists and XML files are downloaded\")\n",
    "        clinvar_data = None\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Error during disease extraction: {e}\")\n",
    "        clinvar_data = None\n",
    "else:\n",
    "    print(\"⚠️ Cannot extract diseases - ClinVar data not loaded\")\n",
    "    clinvar_data = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "277445cd-72b9-44a4-a257-49cd3202e501",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHROM</th>\n",
       "      <th>POS</th>\n",
       "      <th>REF</th>\n",
       "      <th>ALT</th>\n",
       "      <th>LABEL</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>CONSEQUENCE</th>\n",
       "      <th>ID</th>\n",
       "      <th>REVIEW_STATUS</th>\n",
       "      <th>GENE</th>\n",
       "      <th>split</th>\n",
       "      <th>INT_LABEL</th>\n",
       "      <th>GENE_ID</th>\n",
       "      <th>Disease</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>PERM1</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>84808</td>\n",
       "      <td>Renal tubular epithelial cell apoptosis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>PERM1</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>84808</td>\n",
       "      <td>Neutrophil inclusion bodies</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050449</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1284257</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "      <td>Congenital myasthenic syndrome 8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050575</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>18241</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "      <td>Congenital myasthenic syndrome 8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1213738</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>96692</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>TNFRSF4</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>7293</td>\n",
       "      <td>Combined immunodeficiency due to OX40 deficiency</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32680</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787412</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9747</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32681</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787426</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9739</td>\n",
       "      <td>criteria_provided,_single_submitter</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>not provided</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32682</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787515</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>492908</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32683</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787551</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9754</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32684</th>\n",
       "      <td>chrY</td>\n",
       "      <td>7063898</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>625467</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>LOC126057105, TBL1Y</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>126057105, 90665</td>\n",
       "      <td>Deafness, Y-linked 2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>32685 rows × 14 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      CHROM      POS REF ALT       LABEL   SOURCE       CONSEQUENCE       ID  \\\n",
       "0      chr1   976215   A   G  Pathogenic  ClinVar  missense_variant  1320032   \n",
       "1      chr1   976215   A   G  Pathogenic  ClinVar  missense_variant  1320032   \n",
       "2      chr1  1050449   G   A  Pathogenic  ClinVar  missense_variant  1284257   \n",
       "3      chr1  1050575   G   C  Pathogenic  ClinVar  missense_variant    18241   \n",
       "4      chr1  1213738   G   A  Pathogenic  ClinVar  missense_variant    96692   \n",
       "...     ...      ...  ..  ..         ...      ...               ...      ...   \n",
       "32680  chrY  2787412   C   T  Pathogenic  ClinVar  missense_variant     9747   \n",
       "32681  chrY  2787426   C   G  Pathogenic  ClinVar  missense_variant     9739   \n",
       "32682  chrY  2787515   C   A  Pathogenic  ClinVar  missense_variant   492908   \n",
       "32683  chrY  2787551   C   T  Pathogenic  ClinVar  missense_variant     9754   \n",
       "32684  chrY  7063898   A   T  Pathogenic  ClinVar  missense_variant   625467   \n",
       "\n",
       "                             REVIEW_STATUS                 GENE  split  \\\n",
       "0           no_assertion_criteria_provided                PERM1  train   \n",
       "1           no_assertion_criteria_provided                PERM1  train   \n",
       "2           no_assertion_criteria_provided                 AGRN  train   \n",
       "3           no_assertion_criteria_provided                 AGRN  train   \n",
       "4           no_assertion_criteria_provided              TNFRSF4  train   \n",
       "...                                    ...                  ...    ...   \n",
       "32680       no_assertion_criteria_provided                  SRY  train   \n",
       "32681  criteria_provided,_single_submitter                  SRY  train   \n",
       "32682       no_assertion_criteria_provided                  SRY  train   \n",
       "32683       no_assertion_criteria_provided                  SRY  train   \n",
       "32684       no_assertion_criteria_provided  LOC126057105, TBL1Y  train   \n",
       "\n",
       "      INT_LABEL           GENE_ID  \\\n",
       "0             1             84808   \n",
       "1             1             84808   \n",
       "2             1            375790   \n",
       "3             1            375790   \n",
       "4             1              7293   \n",
       "...         ...               ...   \n",
       "32680         1              6736   \n",
       "32681         1              6736   \n",
       "32682         1              6736   \n",
       "32683         1              6736   \n",
       "32684         1  126057105, 90665   \n",
       "\n",
       "                                                Disease  \n",
       "0               Renal tubular epithelial cell apoptosis  \n",
       "1                           Neutrophil inclusion bodies  \n",
       "2                      Congenital myasthenic syndrome 8  \n",
       "3                      Congenital myasthenic syndrome 8  \n",
       "4      Combined immunodeficiency due to OX40 deficiency  \n",
       "...                                                 ...  \n",
       "32680                              46,XY sex reversal 1  \n",
       "32681                                      not provided  \n",
       "32682                              46,XY sex reversal 1  \n",
       "32683                              46,XY sex reversal 1  \n",
       "32684                              Deafness, Y-linked 2  \n",
       "\n",
       "[32685 rows x 14 columns]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clinvar_data\n",
    "\n",
    "# Display the dataset with disease information\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    print(f\"📊 Dataset with diseases: {clinvar_data.shape}\")\n",
    "    \n",
    "    # Show disease statistics\n",
    "    disease_counts = clinvar_data['Disease'].value_counts()\n",
    "    print(f\"\\n🔬 Disease distribution (top 10):\")\n",
    "    print(disease_counts.head(10))\n",
    "    \n",
    "    # Show sample data\n",
    "    print(\"\\n🔍 Sample data:\")\n",
    "    display(clinvar_data.head())\n",
    "else:\n",
    "    print(\"❌ No disease data to display\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6b1c6dc-33ed-4f57-a385-29816f4c9984",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "np.int64(2749)"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count entries with 'not provided' disease information\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    not_provided_count = (clinvar_data[\"Disease\"] == \"not provided\").sum()\n",
    "    total_count = len(clinvar_data)\n",
    "    \n",
    "    print(f\"📊 Entries with 'not provided' disease: {not_provided_count}\")\n",
    "    print(f\"📊 Total entries: {total_count}\")\n",
    "    print(f\"📊 Percentage: {not_provided_count/total_count*100:.1f}%\")\n",
    "else:\n",
    "    print(\"❌ Cannot calculate statistics - data not available\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a7513ee-96b2-4c7d-8678-0195eb826aa5",
   "metadata": {},
   "source": [
    "## Gene ID to Gene Name Mapping\n",
    "\n",
    "This section converts gene IDs to human-readable gene names using NCBI Entrez utilities.\n",
    "\n",
    "**Prerequisites**: NCBI Entrez Direct tools must be installed:\n",
    "- macOS: `brew install brewsci/bio/edirect`\n",
    "- Linux: Follow NCBI EDirect installation guide\n",
    "\n",
    "The process:\n",
    "1. Extract unique gene IDs from the dataset\n",
    "2. Use `esummary` to fetch gene descriptions from NCBI\n",
    "3. Create a mapping dictionary\n",
    "4. Apply the mapping to add gene names to the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee0d3632-d11e-4429-bb50-5eb9ba55d424",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python3\n",
    "\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "# Extract unique gene IDs and create mapping file\n",
    "# This prepares the gene ID list for NCBI lookup\n",
    "\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    # Extract all unique gene IDs\n",
    "    all_gene_ids = set()\n",
    "    \n",
    "    for gene_id_str in clinvar_data['GENE_ID'].dropna():\n",
    "        if gene_id_str.strip():  # Skip empty strings\n",
    "            # Split comma-separated IDs\n",
    "            ids = [gid.strip() for gid in gene_id_str.split(',') if gid.strip()]\n",
    "            all_gene_ids.update(ids)\n",
    "    \n",
    "    # Save unique gene IDs to file\n",
    "    with open(\"gene_id.txt\", 'w') as f:\n",
    "        for gene_id in sorted(all_gene_ids):\n",
    "            f.write(f\"{gene_id}\\n\")\n",
    "    \n",
    "    print(f\"✅ Extracted {len(all_gene_ids)} unique gene IDs to gene_id.txt\")\n",
    "    \n",
    "    # Create the shell script for NCBI lookup\n",
    "    script_content = '''#!/bin/bash\n",
    "\n",
    "input_file=\"gene_id.txt\"\n",
    "output_file=\"gene_id_to_name.json\"\n",
    "\n",
    "# Check if input file exists\n",
    "if [ ! -f \"$input_file\" ]; then\n",
    "    echo \"❌ Error: $input_file not found\"\n",
    "    exit 1\n",
    "fi\n",
    "\n",
    "# Check if EDirect tools are available\n",
    "if ! command -v esummary &> /dev/null; then\n",
    "    echo \"❌ Error: NCBI EDirect tools not found\"\n",
    "    echo \"Please install: brew install brewsci/bio/edirect (macOS)\"\n",
    "    exit 1\n",
    "fi\n",
    "\n",
    "echo \"🚀 Starting gene ID to name mapping...\"\n",
    "\n",
    "# Start JSON object\n",
    "echo \"{\" > \"$output_file\"\n",
    "\n",
    "first_entry=true\n",
    "total_lines=$(wc -l < \"$input_file\")\n",
    "current_line=0\n",
    "\n",
    "while IFS= read -r gene_id; do\n",
    "    # Skip empty lines\n",
    "    [[ -z \"$gene_id\" ]] && continue\n",
    "    \n",
    "    current_line=$((current_line + 1))\n",
    "    \n",
    "    # Progress indicator\n",
    "    if (( current_line % 50 == 0 )); then\n",
    "        echo \"📊 Processing $current_line/$total_lines gene IDs...\"\n",
    "    fi\n",
    "    \n",
    "    # Fetch gene description using Entrez Direct\n",
    "    description=$(esummary -db gene -id \"$gene_id\" 2>/dev/null | xtract -pattern DocumentSummary -element Description)\n",
    "    \n",
    "    # Handle empty description\n",
    "    if [ -z \"$description\" ]; then\n",
    "        description=\"Unknown\"\n",
    "    fi\n",
    "    \n",
    "    # JSON escape quotes and other special characters\n",
    "    description=$(printf '%s' \"$description\" | sed 's/\"/\\\\\"/g')\n",
    "    \n",
    "    # Add comma if not the first entry\n",
    "    if [ \"$first_entry\" = true ]; then\n",
    "        first_entry=false\n",
    "    else\n",
    "        echo \",\" >> \"$output_file\"\n",
    "    fi\n",
    "    \n",
    "    # Append key-value pair\n",
    "    echo \"  \\\"$gene_id\\\": \\\"$description\\\"\" >> \"$output_file\"\n",
    "    \n",
    "done < \"$input_file\"\n",
    "\n",
    "# Close JSON object\n",
    "echo \"\" >> \"$output_file\"\n",
    "echo \"}\" >> \"$output_file\"\n",
    "\n",
    "echo \"✅ Gene ID to name mapping completed\"\n",
    "echo \"💾 Results saved to $output_file\"\n",
    "'''\n",
    "    \n",
    "    # Write the script\n",
    "    with open(\"gene_mapping.sh\", 'w') as f:\n",
    "        f.write(script_content)\n",
    "    \n",
    "    # Make executable\n",
    "    os.chmod(\"gene_mapping.sh\", 0o755)\n",
    "    \n",
    "    print(\"✅ Created gene_mapping.sh script\")\n",
    "    print(\"\\n🚀 To run the gene mapping:\")\n",
    "    print(\"   ./gene_mapping.sh\")\n",
    "    print(\"\\n⚠️ Note: This requires NCBI EDirect tools to be installed\")\n",
    "    \n",
    "else:\n",
    "    print(\"⚠️ Cannot create gene mapping - data not available\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1957ef57-1af8-46a1-8d1b-147f6b423619",
   "metadata": {},
   "source": [
    "## Apply Gene Name Mapping\n",
    "\n",
    "Load the gene ID to name mapping and apply it to the dataset to add human-readable gene names.\n",
    "\n",
    "Read json and add it to the clinvar_data df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b39be718-c0ae-4aae-b1d8-d0c872947ec2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# Load gene ID to name mapping and apply to dataset\n",
    "\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    try:\n",
    "        # Load gene ID → name mapping\n",
    "        with open(\"gene_id_to_name.json\", \"r\") as f:\n",
    "            gene_id_dict = json.load(f)\n",
    "        \n",
    "        print(f\"✅ Loaded mapping for {len(gene_id_dict)} gene IDs\")\n",
    "        \n",
    "        # Function to convert gene IDs to gene names\n",
    "        def get_gene_names(gene_id_str):\n",
    "            if pd.isna(gene_id_str) or not gene_id_str.strip():\n",
    "                return \"\"\n",
    "            \n",
    "            gene_ids = [gid.strip() for gid in gene_id_str.split(\",\") if gid.strip()]\n",
    "            gene_names = []\n",
    "            \n",
    "            for gid in gene_ids:\n",
    "                gene_name = gene_id_dict.get(gid, f\"Unknown_ID_{gid}\")\n",
    "                gene_names.append(gene_name)\n",
    "            \n",
    "            return \" | \".join(gene_names)\n",
    "        \n",
    "        # Apply mapping to create gene names column\n",
    "        print(\"📊 Applying gene name mapping...\")\n",
    "        clinvar_data[\"GENE_Name\"] = clinvar_data[\"GENE_ID\"].apply(get_gene_names)\n",
    "        \n",
    "        # Statistics\n",
    "        mapped_count = (clinvar_data[\"GENE_Name\"] != \"\").sum()\n",
    "        print(f\"✅ Gene names mapped for {mapped_count} entries ({mapped_count/len(clinvar_data)*100:.1f}%)\")\n",
    "        \n",
    "        # Show sample mappings\n",
    "        sample_data = clinvar_data[clinvar_data[\"GENE_Name\"] != \"\"][[\"GENE_ID\", \"GENE_Name\"]].head()\n",
    "        if not sample_data.empty:\n",
    "            print(\"\\n🔍 Sample gene ID to name mappings:\")\n",
    "            for _, row in sample_data.iterrows():\n",
    "                print(f\"   {row['GENE_ID']} → {row['GENE_Name'][:100]}{'...' if len(row['GENE_Name']) > 100 else ''}\")\n",
    "        \n",
    "    except FileNotFoundError:\n",
    "        print(\"❌ Error: gene_id_to_name.json not found\")\n",
    "        print(\"Please run the gene mapping script first: ./gene_mapping.sh\")\n",
    "        # Create empty column as fallback\n",
    "        clinvar_data[\"GENE_Name\"] = \"\"\n",
    "    except json.JSONDecodeError as e:\n",
    "        print(f\"❌ Error parsing JSON mapping file: {e}\")\n",
    "        clinvar_data[\"GENE_Name\"] = \"\"\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Error applying gene mapping: {e}\")\n",
    "        clinvar_data[\"GENE_Name\"] = \"\"\n",
    "else:\n",
    "    print(\"⚠️ Cannot apply gene mapping - data not available\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b7a44c2-7823-47c1-b268-22a1815ffd09",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CHROM</th>\n",
       "      <th>POS</th>\n",
       "      <th>REF</th>\n",
       "      <th>ALT</th>\n",
       "      <th>LABEL</th>\n",
       "      <th>SOURCE</th>\n",
       "      <th>CONSEQUENCE</th>\n",
       "      <th>ID</th>\n",
       "      <th>REVIEW_STATUS</th>\n",
       "      <th>GENE</th>\n",
       "      <th>split</th>\n",
       "      <th>INT_LABEL</th>\n",
       "      <th>GENE_ID</th>\n",
       "      <th>Disease</th>\n",
       "      <th>GENE_Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>PERM1</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>84808</td>\n",
       "      <td>Renal tubular epithelial cell apoptosis</td>\n",
       "      <td>PPARGC1 and ESRR induced regulator, muscle 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1</td>\n",
       "      <td>976215</td>\n",
       "      <td>A</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1320032</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>PERM1</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>84808</td>\n",
       "      <td>Neutrophil inclusion bodies</td>\n",
       "      <td>PPARGC1 and ESRR induced regulator, muscle 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050449</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>1284257</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "      <td>Congenital myasthenic syndrome 8</td>\n",
       "      <td>agrin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1050575</td>\n",
       "      <td>G</td>\n",
       "      <td>C</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>18241</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>AGRN</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>375790</td>\n",
       "      <td>Congenital myasthenic syndrome 8</td>\n",
       "      <td>agrin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1</td>\n",
       "      <td>1213738</td>\n",
       "      <td>G</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>96692</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>TNFRSF4</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>7293</td>\n",
       "      <td>Combined immunodeficiency due to OX40 deficiency</td>\n",
       "      <td>TNF receptor superfamily member 4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32680</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787412</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9747</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "      <td>sex determining region Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32681</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787426</td>\n",
       "      <td>C</td>\n",
       "      <td>G</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9739</td>\n",
       "      <td>criteria_provided,_single_submitter</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>not provided</td>\n",
       "      <td>sex determining region Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32682</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787515</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>492908</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "      <td>sex determining region Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32683</th>\n",
       "      <td>chrY</td>\n",
       "      <td>2787551</td>\n",
       "      <td>C</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>9754</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>SRY</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>6736</td>\n",
       "      <td>46,XY sex reversal 1</td>\n",
       "      <td>sex determining region Y</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32684</th>\n",
       "      <td>chrY</td>\n",
       "      <td>7063898</td>\n",
       "      <td>A</td>\n",
       "      <td>T</td>\n",
       "      <td>Pathogenic</td>\n",
       "      <td>ClinVar</td>\n",
       "      <td>missense_variant</td>\n",
       "      <td>625467</td>\n",
       "      <td>no_assertion_criteria_provided</td>\n",
       "      <td>LOC126057105, TBL1Y</td>\n",
       "      <td>train</td>\n",
       "      <td>1</td>\n",
       "      <td>126057105, 90665</td>\n",
       "      <td>Deafness, Y-linked 2</td>\n",
       "      <td>P300/CBP strongly-dependent group 1 enhancer G...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>32685 rows × 15 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      CHROM      POS REF ALT       LABEL   SOURCE       CONSEQUENCE       ID  \\\n",
       "0      chr1   976215   A   G  Pathogenic  ClinVar  missense_variant  1320032   \n",
       "1      chr1   976215   A   G  Pathogenic  ClinVar  missense_variant  1320032   \n",
       "2      chr1  1050449   G   A  Pathogenic  ClinVar  missense_variant  1284257   \n",
       "3      chr1  1050575   G   C  Pathogenic  ClinVar  missense_variant    18241   \n",
       "4      chr1  1213738   G   A  Pathogenic  ClinVar  missense_variant    96692   \n",
       "...     ...      ...  ..  ..         ...      ...               ...      ...   \n",
       "32680  chrY  2787412   C   T  Pathogenic  ClinVar  missense_variant     9747   \n",
       "32681  chrY  2787426   C   G  Pathogenic  ClinVar  missense_variant     9739   \n",
       "32682  chrY  2787515   C   A  Pathogenic  ClinVar  missense_variant   492908   \n",
       "32683  chrY  2787551   C   T  Pathogenic  ClinVar  missense_variant     9754   \n",
       "32684  chrY  7063898   A   T  Pathogenic  ClinVar  missense_variant   625467   \n",
       "\n",
       "                             REVIEW_STATUS                 GENE  split  \\\n",
       "0           no_assertion_criteria_provided                PERM1  train   \n",
       "1           no_assertion_criteria_provided                PERM1  train   \n",
       "2           no_assertion_criteria_provided                 AGRN  train   \n",
       "3           no_assertion_criteria_provided                 AGRN  train   \n",
       "4           no_assertion_criteria_provided              TNFRSF4  train   \n",
       "...                                    ...                  ...    ...   \n",
       "32680       no_assertion_criteria_provided                  SRY  train   \n",
       "32681  criteria_provided,_single_submitter                  SRY  train   \n",
       "32682       no_assertion_criteria_provided                  SRY  train   \n",
       "32683       no_assertion_criteria_provided                  SRY  train   \n",
       "32684       no_assertion_criteria_provided  LOC126057105, TBL1Y  train   \n",
       "\n",
       "      INT_LABEL           GENE_ID  \\\n",
       "0             1             84808   \n",
       "1             1             84808   \n",
       "2             1            375790   \n",
       "3             1            375790   \n",
       "4             1              7293   \n",
       "...         ...               ...   \n",
       "32680         1              6736   \n",
       "32681         1              6736   \n",
       "32682         1              6736   \n",
       "32683         1              6736   \n",
       "32684         1  126057105, 90665   \n",
       "\n",
       "                                                Disease  \\\n",
       "0               Renal tubular epithelial cell apoptosis   \n",
       "1                           Neutrophil inclusion bodies   \n",
       "2                      Congenital myasthenic syndrome 8   \n",
       "3                      Congenital myasthenic syndrome 8   \n",
       "4      Combined immunodeficiency due to OX40 deficiency   \n",
       "...                                                 ...   \n",
       "32680                              46,XY sex reversal 1   \n",
       "32681                                      not provided   \n",
       "32682                              46,XY sex reversal 1   \n",
       "32683                              46,XY sex reversal 1   \n",
       "32684                              Deafness, Y-linked 2   \n",
       "\n",
       "                                               GENE_Name  \n",
       "0           PPARGC1 and ESRR induced regulator, muscle 1  \n",
       "1           PPARGC1 and ESRR induced regulator, muscle 1  \n",
       "2                                                  agrin  \n",
       "3                                                  agrin  \n",
       "4                      TNF receptor superfamily member 4  \n",
       "...                                                  ...  \n",
       "32680                           sex determining region Y  \n",
       "32681                           sex determining region Y  \n",
       "32682                           sex determining region Y  \n",
       "32683                           sex determining region Y  \n",
       "32684  P300/CBP strongly-dependent group 1 enhancer G...  \n",
       "\n",
       "[32685 rows x 15 columns]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display final dataset with all extracted information\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    print(f\"📊 Final dataset shape: {clinvar_data.shape}\")\n",
    "    print(f\"📋 Columns: {list(clinvar_data.columns)}\")\n",
    "    \n",
    "    # Data completeness statistics\n",
    "    print(\"\\n📈 Data Completeness:\")\n",
    "    for col in ['GENE', 'GENE_ID', 'GENE_Name', 'Disease']:\n",
    "        if col in clinvar_data.columns:\n",
    "            filled_count = (clinvar_data[col] != '').sum()\n",
    "            print(f\"   {col}: {filled_count}/{len(clinvar_data)} ({filled_count/len(clinvar_data)*100:.1f}%)\")\n",
    "    \n",
    "    # Sample data\n",
    "    print(\"\\n🔍 Sample data:\")\n",
    "    display(clinvar_data.head())\n",
    "    \n",
    "    # Memory usage\n",
    "    memory_mb = clinvar_data.memory_usage(deep=True).sum() / 1024 / 1024\n",
    "    print(f\"\\n💾 Dataset memory usage: {memory_mb:.1f} MB\")\n",
    "else:\n",
    "    print(\"❌ No final data to display\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c545ae83-5cd1-4e29-87fd-69389bdb153f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'P300/CBP strongly-dependent group 1 enhancer GRCh37_chrY:6931456-6932655| transducin beta like 1 Y-linked'"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show example of gene name mapping\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None and len(clinvar_data) > 32684:\n",
    "    example_gene_name = clinvar_data.iloc[32684]['GENE_Name']\n",
    "    example_gene_id = clinvar_data.iloc[32684]['GENE_ID']\n",
    "    \n",
    "    print(f\"🔍 Example gene mapping for row 32684:\")\n",
    "    print(f\"   Gene ID: {example_gene_id}\")\n",
    "    print(f\"   Gene Name: {example_gene_name}\")\n",
    "else:\n",
    "    # Show any available example\n",
    "    if 'clinvar_data' in locals() and clinvar_data is not None and not clinvar_data.empty:\n",
    "        # Find first row with gene name data\n",
    "        example_row = clinvar_data[clinvar_data['GENE_Name'] != ''].iloc[0] if (clinvar_data['GENE_Name'] != '').any() else clinvar_data.iloc[0]\n",
    "        \n",
    "        print(f\"🔍 Example gene mapping:\")\n",
    "        print(f\"   Gene ID: {example_row.get('GENE_ID', 'N/A')}\")\n",
    "        print(f\"   Gene Name: {example_row.get('GENE_Name', 'N/A')}\")\n",
    "    else:\n",
    "        print(\"❌ No data available for example\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a214c29d-a4f1-4af6-a914-e6b4a14a1c49",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Save the final processed dataset\n",
    "if 'clinvar_data' in locals() and clinvar_data is not None:\n",
    "    output_file = \"clinvar_with_disease.csv\"\n",
    "    \n",
    "    try:\n",
    "        clinvar_data.to_csv(output_file, index=False)\n",
    "        \n",
    "        print(f\"✅ Final dataset saved to {output_file}\")\n",
    "        print(f\"📊 Saved {len(clinvar_data)} records with {len(clinvar_data.columns)} columns\")\n",
    "        \n",
    "        # File size\n",
    "        file_size = os.path.getsize(output_file) / 1024 / 1024\n",
    "        print(f\"💾 File size: {file_size:.1f} MB\")\n",
    "        \n",
    "        # Summary of what was accomplished\n",
    "        print(\"\\n🎯 Processing Summary:\")\n",
    "        print(f\"   ✓ Extracted ClinVar coding variants\")\n",
    "        print(f\"   ✓ Parsed XML records for gene information\")\n",
    "        print(f\"   ✓ Mapped diseases/phenotypes\")\n",
    "        print(f\"   ✓ Added human-readable gene names\")\n",
    "        print(f\"   ✓ Created comprehensive dataset\")\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"❌ Error saving dataset: {e}\")\n",
    "else:\n",
    "    print(\"⚠️ No data available to save\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6c4c1f4-4b87-4624-8f8a-c568e40b2e63",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "\n",
    "# Optional: Clean up temporary XML data directory\n",
    "# Uncomment the following lines if you want to remove the XML files to save space\n",
    "\n",
    "if os.path.exists(\"data\") and os.path.isdir(\"data\"):\n",
    "    # Count files before cleanup\n",
    "    xml_files = [f for f in os.listdir(\"data\") if f.endswith('.xml')]\n",
    "    \n",
    "    print(f\"🗂️ Found {len(xml_files)} XML files in data directory\")\n",
    "    \n",
    "    # Uncomment to actually remove the directory\n",
    "    # shutil.rmtree(\"data\")\n",
    "    # print(\"🗑️ Removed temporary XML data directory\")\n",
    "    \n",
    "    print(\"ℹ️ XML files preserved. Uncomment the cleanup code to remove them.\")\n",
    "else:\n",
    "    print(\"ℹ️ No XML data directory found to clean up\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08beea6-6ff7-4900-a8b8-8a719db36189",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Processing Complete ✅\n",
    "\n",
    "The ClinVar coding variants have been successfully processed with the following enhancements:\n",
    "\n",
    "### Generated Files:\n",
    "- `clinvar_coding_raw.csv` - Raw ClinVar entries extracted from VEP data\n",
    "- `Clinvar_ID.txt` - List of ClinVar IDs for processing\n",
    "- `gene_id.txt` - Unique gene IDs for name mapping\n",
    "- `gene_id_to_name.json` - Gene ID to name mapping dictionary\n",
    "- `clinvar_with_disease.csv` - **Final comprehensive dataset**\n",
    "\n",
    "### Dataset Features:\n",
    "- **Variant Information**: Genomic coordinates, alleles, and annotations\n",
    "- **Gene Data**: Symbols, IDs, and human-readable names\n",
    "- **Disease/Phenotype**: Associated conditions and clinical significance\n",
    "- **Expanded Format**: One row per variant-disease combination\n",
    "\n",
    "### Next Steps:\n",
    "1. **Quality Control**: Review the data for completeness and accuracy\n",
    "2. **Analysis**: Use the dataset for downstream genetic analysis\n",
    "3. **Integration**: Combine with other datasets as needed\n",
    "4. **Documentation**: Update metadata and create data dictionary\n",
    "\n",
    "### File Cleanup:\n",
    "- XML files in `data/` directory can be removed to save space\n",
    "- Intermediate files can be archived or removed as needed"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}