{ "cells": [ { "cell_type": "markdown", "id": "83c9cd1f", "metadata": {}, "source": [ "## Setup and Data Preparation\n", "\n", "Initial setup steps to prepare the working environment and extract ClinVar data." ] }, { "cell_type": "markdown", "id": "81a36253-9050-4d58-96cd-8238aae51e0e", "metadata": {}, "source": [ "# ClinVar Coding Variants Data Processing\n", "\n", "This notebook processes ClinVar coding variants data by extracting additional information including gene names, gene IDs, and associated diseases from ClinVar XML records.\n", "\n", "## Overview\n", "\n", "The workflow includes:\n", "1. **Data Extraction**: Filter ClinVar entries from VEP-annotated pathogenic coding variants\n", "2. **XML Processing**: Parse ClinVar XML records to extract gene and disease information\n", "3. **Gene Annotation**: Map gene IDs to gene names using NCBI Entrez utilities\n", "4. **Data Integration**: Combine all information into a comprehensive dataset\n", "\n", "## Requirements\n", "\n", "- Python 3.7+\n", "- pandas library\n", "- xml.etree.ElementTree (built-in)\n", "- NCBI Entrez Direct tools (for gene name mapping)\n", "- Input data: VEP-annotated pathogenic coding variants CSV file\n", "\n", "## Data Structure\n", "\n", "The processing creates a dataset with the following key columns:\n", "- Variant information (chromosome, position, alleles)\n", "- ClinVar ID and significance\n", "- Gene symbols and IDs\n", "- Associated disease/phenotype information" ] }, { "cell_type": "code", "execution_count": null, "id": "cb351234-50a3-4061-81ce-bdce5343e790", "metadata": {}, "outputs": [], "source": [ "# Create working directory for ClinVar data processing\n", "import os\n", "os.makedirs('clinvar', exist_ok=True)\n", "print(\"āœ… Created 'clinvar' directory\")" ] }, { "cell_type": "code", "execution_count": null, "id": "443ccab8-50a1-45ae-950c-8425eb318e93", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# Navigate to clinvar directory\n", "os.chdir('clinvar')\n", "print(f\"šŸ“ Current working directory: {os.getcwd()}\")\n", "\n", "with open('vep_pathogenic_coding.csv') as infile, open('clinvar_coding_raw.csv', 'w') as outfile:\n", " for line in infile:\n", " if 'ClinVar' in line:\n", " outfile.write(line)" ] }, { "cell_type": "code", "execution_count": null, "id": "e1f92675-b85c-4baa-8680-9c3776e04ac9", "metadata": {}, "outputs": [], "source": [ "# Extract ClinVar entries from VEP-annotated pathogenic coding variants\n", "# Note: Update the input file path to match your data location\n", "input_file = \"../data/vep_pathogenic_coding.csv\" # Adjust path as needed\n", "output_file = \"clinvar_coding_raw.csv\"\n", "\n", "# Use shell command to filter ClinVar entries\n", "import subprocess\n", "try:\n", " result = subprocess.run(\n", " [\"grep\", \"ClinVar\", input_file],\n", " capture_output=True,\n", " text=True,\n", " check=True\n", " )\n", " \n", " with open(output_file, 'w') as f:\n", " f.write(result.stdout)\n", " \n", " print(f\"āœ… Extracted ClinVar entries to {output_file}\")\n", " print(f\"šŸ“Š Found {len(result.stdout.strip().split('\\n'))} ClinVar entries\")\n", " \n", "except subprocess.CalledProcessError:\n", " print(f\"āŒ Error: Could not find ClinVar entries in {input_file}\")\n", " print(\"Please ensure the input file exists and contains ClinVar annotations\")\n", "except FileNotFoundError:\n", " print(f\"āŒ Error: Input file {input_file} not found\")\n", " print(\"Please update the input_file path to point to your VEP-annotated data\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7e560308-135b-4189-9146-ff50845839a4", "metadata": {}, "outputs": [], "source": [ "# Extract ClinVar IDs from the filtered data (assuming ID is in column 8)\n", "# Note: Adjust column number if your data structure is different\n", "import pandas as pd\n", "\n", "try:\n", " # Read the raw ClinVar data to determine structure\n", " df_temp = pd.read_csv(\"clinvar_coding_raw.csv\")\n", " print(f\"šŸ“‹ Data shape: {df_temp.shape}\")\n", " print(f\"šŸ“‹ Columns: {list(df_temp.columns)}\")\n", " \n", " # Extract ClinVar IDs (adjust column index as needed)\n", " # Column 8 corresponds to index 7 in Python (0-based)\n", " if df_temp.shape[1] >= 8:\n", " clinvar_ids = df_temp.iloc[:, 7] # 8th column (0-based index 7)\n", " \n", " # Save IDs to file\n", " with open(\"Clinvar_ID.txt\", 'w') as f:\n", " for id_val in clinvar_ids:\n", " if pd.notna(id_val):\n", " f.write(f\"{id_val}\\n\")\n", " \n", " print(f\"āœ… Extracted {len(clinvar_ids.dropna())} ClinVar IDs to Clinvar_ID.txt\")\n", " else:\n", " print(f\"āŒ Error: Expected at least 8 columns, found {df_temp.shape[1]}\")\n", " \n", "except FileNotFoundError:\n", " print(\"āŒ Error: clinvar_coding_raw.csv not found\")\n", " print(\"Please run the previous cell first to extract ClinVar data\")\n", "except Exception as e:\n", " print(f\"āŒ Error processing ClinVar data: {e}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "53b0dfd8-8d49-4c3f-adb4-4c6bfbffcfa9", "metadata": {}, "outputs": [], "source": [ "chmod +x Clinvar_esearch.sh\n", "\n", "## XML Data Retrieval\n", "\n", "**Note**: This step requires creating a shell script (`Clinvar_esearch.sh`) to fetch XML data from NCBI.\n", "\n", "The script should:\n", "1. Read ClinVar IDs from `Clinvar_ID.txt`\n", "2. Use NCBI Entrez Direct tools to fetch XML records\n", "3. Save XML files in a `data/` subdirectory\n", "\n", "Example script content:\n", "```bash\n", "#!/bin/bash\n", "mkdir -p data\n", "while read -r id; do\n", " esearch -db clinvar -query \"$id\" | efetch -format xml > \"data/${id}.xml\"\n", " echo \"Downloaded XML for ClinVar ID: $id\"\n", "done < Clinvar_ID.txt\n", "```\n", "\n", "**Prerequisites**: Install NCBI Entrez Direct tools:\n", "- macOS: `brew install brewsci/bio/edirect`\n", "- Linux: Follow NCBI EDirect installation guide" ] }, { "cell_type": "code", "execution_count": null, "id": "0755ad6d", "metadata": {}, "outputs": [], "source": [ "# Parsing XML for Gene and Disease\n", "\n", "# Make the ClinVar search script executable and run it\n", "# Note: This assumes you have created the Clinvar_esearch.sh script\n", "\n", "import os\n", "import subprocess\n", "\n", "script_path = \"Clinvar_esearch.sh\"\n", "\n", "if os.path.exists(script_path):\n", " # Make script executable\n", " os.chmod(script_path, 0o755)\n", " print(f\"āœ… Made {script_path} executable\")\n", " \n", " # Optionally run the script (uncomment if you want to execute automatically)\n", " # print(\"šŸš€ Running ClinVar XML download script...\")\n", " # result = subprocess.run([f\"./{script_path}\"], capture_output=True, text=True)\n", " # if result.returncode == 0:\n", " # print(\"āœ… XML download completed successfully\")\n", " # else:\n", " # print(f\"āŒ Script execution failed: {result.stderr}\")\n", "else:\n", " print(f\"āš ļø Warning: {script_path} not found\")\n", " print(\"Please create this script manually to download ClinVar XML data\")\n", " print(\"See the documentation in the previous cell for script template\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d21a188b-a0dc-4af2-9b71-5a44d8cd4673", "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import xml.etree.ElementTree as ET\n", "import json\n", "import os\n", "from pathlib import Path\n", "\n", "print(\"šŸ“š Libraries imported successfully\")\n", "print(f\"šŸ“ Current directory: {os.getcwd()}\")\n", "print(f\"šŸ“Š Pandas version: {pd.__version__}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1365615b-ee81-4df0-9fca-df001e9f01d4", "metadata": {}, "outputs": [], "source": [ "# Load the raw ClinVar data\n", "try:\n", " clinvar_raw = pd.read_csv(\"clinvar_coding_raw.csv\")\n", " print(f\"āœ… Loaded ClinVar data: {clinvar_raw.shape[0]} rows, {clinvar_raw.shape[1]} columns\")\n", " print(f\"šŸ“‹ Columns: {list(clinvar_raw.columns)[:10]}\") # Show first 10 columns\n", " \n", "except FileNotFoundError:\n", " print(\"āŒ Error: clinvar_coding_raw.csv not found\")\n", " print(\"Please run the data extraction steps first\")\n", " clinvar_raw = None\n", "except Exception as e:\n", " print(f\"āŒ Error loading data: {e}\")\n", " clinvar_raw = None" ] }, { "cell_type": "code", "execution_count": null, "id": "7144ddf2-abf7-4680-b578-d4bd4b7195ea", "metadata": {}, "outputs": [], "source": [ "# Remove unnecessary columns to streamline the dataset\n", "# Note: Adjust column names based on your actual data structure\n", "\n", "if clinvar_raw is not None:\n", " columns_to_remove = [\n", " \"GENOMIC_MUTATION_ID\", \"N_SAMPLES\", \"TOTAL_SAMPLES\", \"FREQ\", \n", " \"OMIM\", \"PMID\", \"AC\", \"AN\", \"AF\", \"MAF\", \"MAC\"\n", " ]\n", " \n", " # Only remove columns that actually exist in the dataset\n", " existing_columns = [col for col in columns_to_remove if col in clinvar_raw.columns]\n", " missing_columns = [col for col in columns_to_remove if col not in clinvar_raw.columns]\n", " \n", " if existing_columns:\n", " clinvar_raw = clinvar_raw.drop(columns=existing_columns)\n", " print(f\"āœ… Removed {len(existing_columns)} columns: {existing_columns}\")\n", " \n", " if missing_columns:\n", " print(f\"ā„¹ļø Columns not found (skipped): {missing_columns}\")\n", " \n", " print(f\"šŸ“Š Remaining columns: {clinvar_raw.shape[1]}\")\n", "else:\n", " print(\"āš ļø Skipping column removal - data not loaded\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fbffd3cd-7df3-43e2-8d73-01f54e8d1da6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHROMPOSREFALTLABELSOURCECONSEQUENCEIDREVIEW_STATUSGENEsplitINT_LABEL
0chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedNaNtrain1
1chr11050449GAPathogenicClinVarmissense_variant1284257no_assertion_criteria_providedNaNtrain1
2chr11050575GCPathogenicClinVarmissense_variant18241no_assertion_criteria_providedNaNtrain1
3chr11213738GAPathogenicClinVarmissense_variant96692no_assertion_criteria_providedNaNtrain1
4chr11232279AGPathogenicClinVarinitiatior_codon_variant,missense_variant60484criteria_provided,_multiple_submitters,_no_con...NaNtrain1
.......................................
22249chrY2787412CTPathogenicClinVarmissense_variant9747no_assertion_criteria_providedNaNtrain1
22250chrY2787426CGPathogenicClinVarmissense_variant9739criteria_provided,_single_submitterNaNtrain1
22251chrY2787515CAPathogenicClinVarmissense_variant492908no_assertion_criteria_providedNaNtrain1
22252chrY2787551CTPathogenicClinVarmissense_variant9754no_assertion_criteria_providedNaNtrain1
22253chrY7063898ATPathogenicClinVarmissense_variant625467no_assertion_criteria_providedNaNtrain1
\n", "

22254 rows Ɨ 12 columns

\n", "
" ], "text/plain": [ " CHROM POS REF ALT LABEL SOURCE \\\n", "0 chr1 976215 A G Pathogenic ClinVar \n", "1 chr1 1050449 G A Pathogenic ClinVar \n", "2 chr1 1050575 G C Pathogenic ClinVar \n", "3 chr1 1213738 G A Pathogenic ClinVar \n", "4 chr1 1232279 A G Pathogenic ClinVar \n", "... ... ... .. .. ... ... \n", "22249 chrY 2787412 C T Pathogenic ClinVar \n", "22250 chrY 2787426 C G Pathogenic ClinVar \n", "22251 chrY 2787515 C A Pathogenic ClinVar \n", "22252 chrY 2787551 C T Pathogenic ClinVar \n", "22253 chrY 7063898 A T Pathogenic ClinVar \n", "\n", " CONSEQUENCE ID \\\n", "0 missense_variant 1320032 \n", "1 missense_variant 1284257 \n", "2 missense_variant 18241 \n", "3 missense_variant 96692 \n", "4 initiatior_codon_variant,missense_variant 60484 \n", "... ... ... \n", "22249 missense_variant 9747 \n", "22250 missense_variant 9739 \n", "22251 missense_variant 492908 \n", "22252 missense_variant 9754 \n", "22253 missense_variant 625467 \n", "\n", " REVIEW_STATUS GENE split \\\n", "0 no_assertion_criteria_provided NaN train \n", "1 no_assertion_criteria_provided NaN train \n", "2 no_assertion_criteria_provided NaN train \n", "3 no_assertion_criteria_provided NaN train \n", "4 criteria_provided,_multiple_submitters,_no_con... NaN train \n", "... ... ... ... \n", "22249 no_assertion_criteria_provided NaN train \n", "22250 criteria_provided,_single_submitter NaN train \n", "22251 no_assertion_criteria_provided NaN train \n", "22252 no_assertion_criteria_provided NaN train \n", "22253 no_assertion_criteria_provided NaN train \n", "\n", " INT_LABEL \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "... ... \n", "22249 1 \n", "22250 1 \n", "22251 1 \n", "22252 1 \n", "22253 1 \n", "\n", "[22254 rows x 12 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinvar_raw\n", "\n", "# Preview the cleaned dataset\n", "if clinvar_raw is not None:\n", " print(f\"šŸ“Š Dataset shape: {clinvar_raw.shape}\")\n", " print(f\"šŸ“‹ Column names: {list(clinvar_raw.columns)}\")\n", " print(\"\\nšŸ” First few rows:\")\n", " display(clinvar_raw.head())\n", " \n", " # Check for any null values\n", " null_counts = clinvar_raw.isnull().sum()\n", " if null_counts.sum() > 0:\n", " print(\"\\nāš ļø Null values found:\")\n", " print(null_counts[null_counts > 0])\n", "else:\n", " print(\"āŒ No data to display\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e380634b-0c22-4d1e-8520-6fc5728e7de5", "metadata": {}, "outputs": [], "source": [ "# Add new columns for gene information\n", "if clinvar_raw is not None:\n", " clinvar_raw['GENE_ID'] = \"\"\n", " clinvar_raw['GENE'] = \"\"\n", " print(\"āœ… Added GENE_ID and GENE columns\")\n", " print(f\"šŸ“Š Updated dataset shape: {clinvar_raw.shape}\")\n", "else:\n", " print(\"āš ļø Cannot add columns - data not loaded\")" ] }, { "cell_type": "code", "execution_count": null, "id": "92b159f5-694d-4ee4-9616-1ebf00f71904", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHROMPOSREFALTLABELSOURCECONSEQUENCEIDREVIEW_STATUSGENEsplitINT_LABELGENE_ID
0chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedtrain1
1chr11050449GAPathogenicClinVarmissense_variant1284257no_assertion_criteria_providedtrain1
2chr11050575GCPathogenicClinVarmissense_variant18241no_assertion_criteria_providedtrain1
3chr11213738GAPathogenicClinVarmissense_variant96692no_assertion_criteria_providedtrain1
4chr11232279AGPathogenicClinVarinitiatior_codon_variant,missense_variant60484criteria_provided,_multiple_submitters,_no_con...train1
..........................................
22249chrY2787412CTPathogenicClinVarmissense_variant9747no_assertion_criteria_providedtrain1
22250chrY2787426CGPathogenicClinVarmissense_variant9739criteria_provided,_single_submittertrain1
22251chrY2787515CAPathogenicClinVarmissense_variant492908no_assertion_criteria_providedtrain1
22252chrY2787551CTPathogenicClinVarmissense_variant9754no_assertion_criteria_providedtrain1
22253chrY7063898ATPathogenicClinVarmissense_variant625467no_assertion_criteria_providedtrain1
\n", "

22254 rows Ɨ 13 columns

\n", "
" ], "text/plain": [ " CHROM POS REF ALT LABEL SOURCE \\\n", "0 chr1 976215 A G Pathogenic ClinVar \n", "1 chr1 1050449 G A Pathogenic ClinVar \n", "2 chr1 1050575 G C Pathogenic ClinVar \n", "3 chr1 1213738 G A Pathogenic ClinVar \n", "4 chr1 1232279 A G Pathogenic ClinVar \n", "... ... ... .. .. ... ... \n", "22249 chrY 2787412 C T Pathogenic ClinVar \n", "22250 chrY 2787426 C G Pathogenic ClinVar \n", "22251 chrY 2787515 C A Pathogenic ClinVar \n", "22252 chrY 2787551 C T Pathogenic ClinVar \n", "22253 chrY 7063898 A T Pathogenic ClinVar \n", "\n", " CONSEQUENCE ID \\\n", "0 missense_variant 1320032 \n", "1 missense_variant 1284257 \n", "2 missense_variant 18241 \n", "3 missense_variant 96692 \n", "4 initiatior_codon_variant,missense_variant 60484 \n", "... ... ... \n", "22249 missense_variant 9747 \n", "22250 missense_variant 9739 \n", "22251 missense_variant 492908 \n", "22252 missense_variant 9754 \n", "22253 missense_variant 625467 \n", "\n", " REVIEW_STATUS GENE split \\\n", "0 no_assertion_criteria_provided train \n", "1 no_assertion_criteria_provided train \n", "2 no_assertion_criteria_provided train \n", "3 no_assertion_criteria_provided train \n", "4 criteria_provided,_multiple_submitters,_no_con... train \n", "... ... ... ... \n", "22249 no_assertion_criteria_provided train \n", "22250 criteria_provided,_single_submitter train \n", "22251 no_assertion_criteria_provided train \n", "22252 no_assertion_criteria_provided train \n", "22253 no_assertion_criteria_provided train \n", "\n", " INT_LABEL GENE_ID \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "... ... ... \n", "22249 1 \n", "22250 1 \n", "22251 1 \n", "22252 1 \n", "22253 1 \n", "\n", "[22254 rows x 13 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinvar_raw\n", "\n", "# Display updated dataset with new columns\n", "if clinvar_raw is not None:\n", " print(f\"šŸ“Š Dataset with new columns: {clinvar_raw.shape}\")\n", " print(f\"šŸ“‹ All columns: {list(clinvar_raw.columns)}\")\n", " display(clinvar_raw.head())\n", "else:\n", " print(\"āŒ No data to display\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f36db716-392a-46a8-a404-d78165a4623c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import xml.etree.ElementTree as ET\n", "import os\n", "\n", "# Parse ClinVar XML files to extract gene information\n", "# This processes each ClinVar ID and extracts gene symbols and IDs from XML records\n", "\n", "if clinvar_raw is not None:\n", " # Load list of ClinVar IDs\n", " try:\n", " with open(\"Clinvar_ID.txt\", \"r\") as f:\n", " clinvar_ids = [line.strip() for line in f if line.strip()]\n", " \n", " print(f\"šŸ“‹ Processing {len(clinvar_ids)} ClinVar IDs\")\n", " \n", " processed_count = 0\n", " error_count = 0\n", " \n", " # Process each ClinVar ID\n", " for i, clinvar_id in enumerate(clinvar_ids):\n", " if i % 100 == 0: # Progress indicator\n", " print(f\"šŸ“Š Processing ID {i+1}/{len(clinvar_ids)}...\")\n", " \n", " try:\n", " id_int = int(clinvar_id)\n", " xml_path = f'data/{clinvar_id}.xml'\n", " \n", " # Check if XML file exists\n", " if not os.path.exists(xml_path):\n", " print(f\"āš ļø XML file not found: {xml_path}\")\n", " continue\n", " \n", " # Parse XML file\n", " with open(xml_path, 'r', encoding='utf-8') as file:\n", " tree = ET.parse(file)\n", " root = tree.getroot()\n", " \n", " # Check for error in XML\n", " error_element = root.find(\".//error\")\n", " if error_element is not None:\n", " # Remove entries with errors\n", " clinvar_raw = clinvar_raw[clinvar_raw[\"ID\"] != id_int]\n", " error_count += 1\n", " continue\n", " \n", " # Extract gene information\n", " gene_names = []\n", " gene_ids = []\n", " \n", " for gene in root.findall(\".//genes/gene\"):\n", " symbol = gene.findtext(\"symbol\")\n", " gene_id_data = gene.findtext(\"GeneID\")\n", " \n", " if symbol:\n", " gene_names.append(symbol)\n", " if gene_id_data:\n", " gene_ids.append(gene_id_data)\n", " \n", " # Join multiple entries with commas\n", " gene_name_str = \", \".join(gene_names) if gene_names else \"\"\n", " gene_id_str = \", \".join(gene_ids) if gene_ids else \"\"\n", " \n", " # Update DataFrame\n", " mask = clinvar_raw[\"ID\"] == id_int\n", " if mask.any():\n", " clinvar_raw.loc[mask, \"GENE\"] = gene_name_str\n", " clinvar_raw.loc[mask, \"GENE_ID\"] = gene_id_str\n", " processed_count += 1\n", " \n", " except ET.ParseError as e:\n", " print(f\"āš ļø XML parsing error for {clinvar_id}: {e}\")\n", " error_count += 1\n", " except ValueError as e:\n", " print(f\"āš ļø Invalid ClinVar ID {clinvar_id}: {e}\")\n", " error_count += 1\n", " except Exception as e:\n", " print(f\"āš ļø Unexpected error processing {clinvar_id}: {e}\")\n", " error_count += 1\n", " \n", " print(f\"\\nāœ… Processing complete:\")\n", " print(f\" šŸ“Š Successfully processed: {processed_count}\")\n", " print(f\" āŒ Errors encountered: {error_count}\")\n", " print(f\" šŸ“‹ Final dataset shape: {clinvar_raw.shape}\")\n", " \n", " except FileNotFoundError:\n", " print(\"āŒ Error: Clinvar_ID.txt not found\")\n", " print(\"Please run the ID extraction step first\")\n", " except Exception as e:\n", " print(f\"āŒ Error during XML processing: {e}\")\n", "else:\n", " print(\"āš ļø Cannot process XML files - ClinVar data not loaded\")" ] }, { "cell_type": "code", "execution_count": null, "id": "ae0c9d8b-1b12-40a4-82ec-c3452e9dda90", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHROMPOSREFALTLABELSOURCECONSEQUENCEIDREVIEW_STATUSGENEsplitINT_LABELGENE_ID
0chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedPERM1train184808
1chr11050449GAPathogenicClinVarmissense_variant1284257no_assertion_criteria_providedAGRNtrain1375790
2chr11050575GCPathogenicClinVarmissense_variant18241no_assertion_criteria_providedAGRNtrain1375790
3chr11213738GAPathogenicClinVarmissense_variant96692no_assertion_criteria_providedTNFRSF4train17293
4chr11232279AGPathogenicClinVarinitiatior_codon_variant,missense_variant60484criteria_provided,_multiple_submitters,_no_con...B3GALT6train1126792
..........................................
22249chrY2787412CTPathogenicClinVarmissense_variant9747no_assertion_criteria_providedSRYtrain16736
22250chrY2787426CGPathogenicClinVarmissense_variant9739criteria_provided,_single_submitterSRYtrain16736
22251chrY2787515CAPathogenicClinVarmissense_variant492908no_assertion_criteria_providedSRYtrain16736
22252chrY2787551CTPathogenicClinVarmissense_variant9754no_assertion_criteria_providedSRYtrain16736
22253chrY7063898ATPathogenicClinVarmissense_variant625467no_assertion_criteria_providedLOC126057105, TBL1Ytrain1126057105, 90665
\n", "

22150 rows Ɨ 13 columns

\n", "
" ], "text/plain": [ " CHROM POS REF ALT LABEL SOURCE \\\n", "0 chr1 976215 A G Pathogenic ClinVar \n", "1 chr1 1050449 G A Pathogenic ClinVar \n", "2 chr1 1050575 G C Pathogenic ClinVar \n", "3 chr1 1213738 G A Pathogenic ClinVar \n", "4 chr1 1232279 A G Pathogenic ClinVar \n", "... ... ... .. .. ... ... \n", "22249 chrY 2787412 C T Pathogenic ClinVar \n", "22250 chrY 2787426 C G Pathogenic ClinVar \n", "22251 chrY 2787515 C A Pathogenic ClinVar \n", "22252 chrY 2787551 C T Pathogenic ClinVar \n", "22253 chrY 7063898 A T Pathogenic ClinVar \n", "\n", " CONSEQUENCE ID \\\n", "0 missense_variant 1320032 \n", "1 missense_variant 1284257 \n", "2 missense_variant 18241 \n", "3 missense_variant 96692 \n", "4 initiatior_codon_variant,missense_variant 60484 \n", "... ... ... \n", "22249 missense_variant 9747 \n", "22250 missense_variant 9739 \n", "22251 missense_variant 492908 \n", "22252 missense_variant 9754 \n", "22253 missense_variant 625467 \n", "\n", " REVIEW_STATUS GENE \\\n", "0 no_assertion_criteria_provided PERM1 \n", "1 no_assertion_criteria_provided AGRN \n", "2 no_assertion_criteria_provided AGRN \n", "3 no_assertion_criteria_provided TNFRSF4 \n", "4 criteria_provided,_multiple_submitters,_no_con... B3GALT6 \n", "... ... ... \n", "22249 no_assertion_criteria_provided SRY \n", "22250 criteria_provided,_single_submitter SRY \n", "22251 no_assertion_criteria_provided SRY \n", "22252 no_assertion_criteria_provided SRY \n", "22253 no_assertion_criteria_provided LOC126057105, TBL1Y \n", "\n", " split INT_LABEL GENE_ID \n", "0 train 1 84808 \n", "1 train 1 375790 \n", "2 train 1 375790 \n", "3 train 1 7293 \n", "4 train 1 126792 \n", "... ... ... ... \n", "22249 train 1 6736 \n", "22250 train 1 6736 \n", "22251 train 1 6736 \n", "22252 train 1 6736 \n", "22253 train 1 126057105, 90665 \n", "\n", "[22150 rows x 13 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinvar_raw\n", "\n", "# Display the dataset with extracted gene information\n", "if clinvar_raw is not None:\n", " print(f\"šŸ“Š Dataset after gene extraction: {clinvar_raw.shape}\")\n", " \n", " # Show statistics\n", " gene_filled = (clinvar_raw['GENE'] != '').sum()\n", " gene_id_filled = (clinvar_raw['GENE_ID'] != '').sum()\n", " \n", " print(f\"šŸ“‹ Entries with gene names: {gene_filled} ({gene_filled/len(clinvar_raw)*100:.1f}%)\")\n", " print(f\"šŸ“‹ Entries with gene IDs: {gene_id_filled} ({gene_id_filled/len(clinvar_raw)*100:.1f}%)\")\n", " \n", " # Show sample data\n", " display(clinvar_raw.head(10))\n", "else:\n", " print(\"āŒ No data to display\")" ] }, { "cell_type": "markdown", "id": "b76910bd-aa86-4943-a0f2-dcf9756ad81d", "metadata": {}, "source": [ "## Disease/Phenotype Information Extraction\n", "\n", "This section extracts disease and phenotype information from the ClinVar XML records. Each variant may be associated with multiple diseases, so the data is expanded to create one row per variant-disease combination.\n", "\n", "### Putting in the Disease Name" ] }, { "cell_type": "code", "execution_count": null, "id": "54ccd972-5804-4d63-9012-5531034d2b60", "metadata": {}, "outputs": [], "source": [ "# Extract disease/phenotype information from ClinVar XML files\n", "# This creates multiple rows for variants associated with multiple diseases\n", "\n", "if clinvar_raw is not None:\n", " try:\n", " # Load ClinVar IDs\n", " with open(\"Clinvar_ID.txt\", \"r\") as f:\n", " clinvar_ids = [line.strip() for line in f if line.strip()]\n", " \n", " print(f\"šŸ“‹ Processing {len(clinvar_ids)} ClinVar IDs for disease extraction\")\n", " \n", " # Ensure ID column is integer type\n", " clinvar_raw[\"ID\"] = clinvar_raw[\"ID\"].astype(int)\n", " \n", " # Create new DataFrame to store expanded data\n", " clinvar_data = pd.DataFrame(columns=clinvar_raw.columns.tolist() + [\"Disease\"])\n", " \n", " processed_count = 0\n", " disease_count = 0\n", " \n", " # Process each ClinVar ID\n", " for i, clinvar_id in enumerate(clinvar_ids):\n", " if i % 100 == 0: # Progress indicator\n", " print(f\"šŸ“Š Processing disease info {i+1}/{len(clinvar_ids)}...\")\n", " \n", " try:\n", " id_int = int(clinvar_id)\n", " xml_path = f\"data/{clinvar_id}.xml\"\n", " \n", " if not os.path.exists(xml_path):\n", " continue\n", " \n", " # Parse XML\n", " tree = ET.parse(xml_path)\n", " root = tree.getroot()\n", " \n", " # Extract all trait names (diseases/phenotypes)\n", " trait_names = []\n", " for trait in root.findall(\".//trait\"):\n", " trait_name = trait.findtext(\"trait_name\")\n", " if trait_name:\n", " trait_names.append(trait_name)\n", " \n", " # Filter out 'not provided' if other traits exist\n", " filtered_traits = [t for t in trait_names if t.lower() != \"not provided\"]\n", " if not filtered_traits and \"not provided\" in [t.lower() for t in trait_names]:\n", " filtered_traits = [\"not provided\"]\n", " \n", " # If no traits found, use empty string\n", " if not filtered_traits:\n", " filtered_traits = [\"\"]\n", " \n", " # Create one row for each disease/trait\n", " base_row = clinvar_raw[clinvar_raw[\"ID\"] == id_int]\n", " if not base_row.empty:\n", " for disease_name in filtered_traits:\n", " new_row = base_row.copy()\n", " new_row[\"Disease\"] = disease_name\n", " clinvar_data = pd.concat([clinvar_data, new_row], ignore_index=True)\n", " disease_count += 1\n", " processed_count += 1\n", " \n", " except ET.ParseError as e:\n", " print(f\"āš ļø XML parsing error for {clinvar_id}: {e}\")\n", " except Exception as e:\n", " print(f\"āš ļø Error processing {clinvar_id}: {e}\")\n", " \n", " print(f\"\\nāœ… Disease extraction complete:\")\n", " print(f\" šŸ“Š Variants processed: {processed_count}\")\n", " print(f\" šŸ”¬ Total variant-disease pairs: {disease_count}\")\n", " print(f\" šŸ“‹ Final dataset shape: {clinvar_data.shape}\")\n", " \n", " # Save intermediate results\n", " clinvar_data.to_csv(\"clinvar_with_disease.csv\", sep='\\t', index=False)\n", " print(\"šŸ’¾ Saved results to clinvar_with_disease.csv\")\n", " \n", " except FileNotFoundError:\n", " print(\"āŒ Error: Required files not found\")\n", " print(\"Please ensure Clinvar_ID.txt exists and XML files are downloaded\")\n", " clinvar_data = None\n", " except Exception as e:\n", " print(f\"āŒ Error during disease extraction: {e}\")\n", " clinvar_data = None\n", "else:\n", " print(\"āš ļø Cannot extract diseases - ClinVar data not loaded\")\n", " clinvar_data = None" ] }, { "cell_type": "code", "execution_count": null, "id": "277445cd-72b9-44a4-a257-49cd3202e501", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHROMPOSREFALTLABELSOURCECONSEQUENCEIDREVIEW_STATUSGENEsplitINT_LABELGENE_IDDisease
0chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedPERM1train184808Renal tubular epithelial cell apoptosis
1chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedPERM1train184808Neutrophil inclusion bodies
2chr11050449GAPathogenicClinVarmissense_variant1284257no_assertion_criteria_providedAGRNtrain1375790Congenital myasthenic syndrome 8
3chr11050575GCPathogenicClinVarmissense_variant18241no_assertion_criteria_providedAGRNtrain1375790Congenital myasthenic syndrome 8
4chr11213738GAPathogenicClinVarmissense_variant96692no_assertion_criteria_providedTNFRSF4train17293Combined immunodeficiency due to OX40 deficiency
.............................................
32680chrY2787412CTPathogenicClinVarmissense_variant9747no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1
32681chrY2787426CGPathogenicClinVarmissense_variant9739criteria_provided,_single_submitterSRYtrain16736not provided
32682chrY2787515CAPathogenicClinVarmissense_variant492908no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1
32683chrY2787551CTPathogenicClinVarmissense_variant9754no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1
32684chrY7063898ATPathogenicClinVarmissense_variant625467no_assertion_criteria_providedLOC126057105, TBL1Ytrain1126057105, 90665Deafness, Y-linked 2
\n", "

32685 rows Ɨ 14 columns

\n", "
" ], "text/plain": [ " CHROM POS REF ALT LABEL SOURCE CONSEQUENCE ID \\\n", "0 chr1 976215 A G Pathogenic ClinVar missense_variant 1320032 \n", "1 chr1 976215 A G Pathogenic ClinVar missense_variant 1320032 \n", "2 chr1 1050449 G A Pathogenic ClinVar missense_variant 1284257 \n", "3 chr1 1050575 G C Pathogenic ClinVar missense_variant 18241 \n", "4 chr1 1213738 G A Pathogenic ClinVar missense_variant 96692 \n", "... ... ... .. .. ... ... ... ... \n", "32680 chrY 2787412 C T Pathogenic ClinVar missense_variant 9747 \n", "32681 chrY 2787426 C G Pathogenic ClinVar missense_variant 9739 \n", "32682 chrY 2787515 C A Pathogenic ClinVar missense_variant 492908 \n", "32683 chrY 2787551 C T Pathogenic ClinVar missense_variant 9754 \n", "32684 chrY 7063898 A T Pathogenic ClinVar missense_variant 625467 \n", "\n", " REVIEW_STATUS GENE split \\\n", "0 no_assertion_criteria_provided PERM1 train \n", "1 no_assertion_criteria_provided PERM1 train \n", "2 no_assertion_criteria_provided AGRN train \n", "3 no_assertion_criteria_provided AGRN train \n", "4 no_assertion_criteria_provided TNFRSF4 train \n", "... ... ... ... \n", "32680 no_assertion_criteria_provided SRY train \n", "32681 criteria_provided,_single_submitter SRY train \n", "32682 no_assertion_criteria_provided SRY train \n", "32683 no_assertion_criteria_provided SRY train \n", "32684 no_assertion_criteria_provided LOC126057105, TBL1Y train \n", "\n", " INT_LABEL GENE_ID \\\n", "0 1 84808 \n", "1 1 84808 \n", "2 1 375790 \n", "3 1 375790 \n", "4 1 7293 \n", "... ... ... \n", "32680 1 6736 \n", "32681 1 6736 \n", "32682 1 6736 \n", "32683 1 6736 \n", "32684 1 126057105, 90665 \n", "\n", " Disease \n", "0 Renal tubular epithelial cell apoptosis \n", "1 Neutrophil inclusion bodies \n", "2 Congenital myasthenic syndrome 8 \n", "3 Congenital myasthenic syndrome 8 \n", "4 Combined immunodeficiency due to OX40 deficiency \n", "... ... \n", "32680 46,XY sex reversal 1 \n", "32681 not provided \n", "32682 46,XY sex reversal 1 \n", "32683 46,XY sex reversal 1 \n", "32684 Deafness, Y-linked 2 \n", "\n", "[32685 rows x 14 columns]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clinvar_data\n", "\n", "# Display the dataset with disease information\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " print(f\"šŸ“Š Dataset with diseases: {clinvar_data.shape}\")\n", " \n", " # Show disease statistics\n", " disease_counts = clinvar_data['Disease'].value_counts()\n", " print(f\"\\nšŸ”¬ Disease distribution (top 10):\")\n", " print(disease_counts.head(10))\n", " \n", " # Show sample data\n", " print(\"\\nšŸ” Sample data:\")\n", " display(clinvar_data.head())\n", "else:\n", " print(\"āŒ No disease data to display\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c6b1c6dc-33ed-4f57-a385-29816f4c9984", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.int64(2749)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count entries with 'not provided' disease information\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " not_provided_count = (clinvar_data[\"Disease\"] == \"not provided\").sum()\n", " total_count = len(clinvar_data)\n", " \n", " print(f\"šŸ“Š Entries with 'not provided' disease: {not_provided_count}\")\n", " print(f\"šŸ“Š Total entries: {total_count}\")\n", " print(f\"šŸ“Š Percentage: {not_provided_count/total_count*100:.1f}%\")\n", "else:\n", " print(\"āŒ Cannot calculate statistics - data not available\")" ] }, { "cell_type": "markdown", "id": "8a7513ee-96b2-4c7d-8678-0195eb826aa5", "metadata": {}, "source": [ "## Gene ID to Gene Name Mapping\n", "\n", "This section converts gene IDs to human-readable gene names using NCBI Entrez utilities.\n", "\n", "**Prerequisites**: NCBI Entrez Direct tools must be installed:\n", "- macOS: `brew install brewsci/bio/edirect`\n", "- Linux: Follow NCBI EDirect installation guide\n", "\n", "The process:\n", "1. Extract unique gene IDs from the dataset\n", "2. Use `esummary` to fetch gene descriptions from NCBI\n", "3. Create a mapping dictionary\n", "4. Apply the mapping to add gene names to the dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "ee0d3632-d11e-4429-bb50-5eb9ba55d424", "metadata": {}, "outputs": [], "source": [ "#!/usr/bin/env python3\n", "\n", "import os\n", "import pandas as pd\n", "\n", "# Extract unique gene IDs and create mapping file\n", "# This prepares the gene ID list for NCBI lookup\n", "\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " # Extract all unique gene IDs\n", " all_gene_ids = set()\n", " \n", " for gene_id_str in clinvar_data['GENE_ID'].dropna():\n", " if gene_id_str.strip(): # Skip empty strings\n", " # Split comma-separated IDs\n", " ids = [gid.strip() for gid in gene_id_str.split(',') if gid.strip()]\n", " all_gene_ids.update(ids)\n", " \n", " # Save unique gene IDs to file\n", " with open(\"gene_id.txt\", 'w') as f:\n", " for gene_id in sorted(all_gene_ids):\n", " f.write(f\"{gene_id}\\n\")\n", " \n", " print(f\"āœ… Extracted {len(all_gene_ids)} unique gene IDs to gene_id.txt\")\n", " \n", " # Create the shell script for NCBI lookup\n", " script_content = '''#!/bin/bash\n", "\n", "input_file=\"gene_id.txt\"\n", "output_file=\"gene_id_to_name.json\"\n", "\n", "# Check if input file exists\n", "if [ ! -f \"$input_file\" ]; then\n", " echo \"āŒ Error: $input_file not found\"\n", " exit 1\n", "fi\n", "\n", "# Check if EDirect tools are available\n", "if ! command -v esummary &> /dev/null; then\n", " echo \"āŒ Error: NCBI EDirect tools not found\"\n", " echo \"Please install: brew install brewsci/bio/edirect (macOS)\"\n", " exit 1\n", "fi\n", "\n", "echo \"šŸš€ Starting gene ID to name mapping...\"\n", "\n", "# Start JSON object\n", "echo \"{\" > \"$output_file\"\n", "\n", "first_entry=true\n", "total_lines=$(wc -l < \"$input_file\")\n", "current_line=0\n", "\n", "while IFS= read -r gene_id; do\n", " # Skip empty lines\n", " [[ -z \"$gene_id\" ]] && continue\n", " \n", " current_line=$((current_line + 1))\n", " \n", " # Progress indicator\n", " if (( current_line % 50 == 0 )); then\n", " echo \"šŸ“Š Processing $current_line/$total_lines gene IDs...\"\n", " fi\n", " \n", " # Fetch gene description using Entrez Direct\n", " description=$(esummary -db gene -id \"$gene_id\" 2>/dev/null | xtract -pattern DocumentSummary -element Description)\n", " \n", " # Handle empty description\n", " if [ -z \"$description\" ]; then\n", " description=\"Unknown\"\n", " fi\n", " \n", " # JSON escape quotes and other special characters\n", " description=$(printf '%s' \"$description\" | sed 's/\"/\\\\\"/g')\n", " \n", " # Add comma if not the first entry\n", " if [ \"$first_entry\" = true ]; then\n", " first_entry=false\n", " else\n", " echo \",\" >> \"$output_file\"\n", " fi\n", " \n", " # Append key-value pair\n", " echo \" \\\"$gene_id\\\": \\\"$description\\\"\" >> \"$output_file\"\n", " \n", "done < \"$input_file\"\n", "\n", "# Close JSON object\n", "echo \"\" >> \"$output_file\"\n", "echo \"}\" >> \"$output_file\"\n", "\n", "echo \"āœ… Gene ID to name mapping completed\"\n", "echo \"šŸ’¾ Results saved to $output_file\"\n", "'''\n", " \n", " # Write the script\n", " with open(\"gene_mapping.sh\", 'w') as f:\n", " f.write(script_content)\n", " \n", " # Make executable\n", " os.chmod(\"gene_mapping.sh\", 0o755)\n", " \n", " print(\"āœ… Created gene_mapping.sh script\")\n", " print(\"\\nšŸš€ To run the gene mapping:\")\n", " print(\" ./gene_mapping.sh\")\n", " print(\"\\nāš ļø Note: This requires NCBI EDirect tools to be installed\")\n", " \n", "else:\n", " print(\"āš ļø Cannot create gene mapping - data not available\")" ] }, { "cell_type": "markdown", "id": "1957ef57-1af8-46a1-8d1b-147f6b423619", "metadata": {}, "source": [ "## Apply Gene Name Mapping\n", "\n", "Load the gene ID to name mapping and apply it to the dataset to add human-readable gene names.\n", "\n", "Read json and add it to the clinvar_data df" ] }, { "cell_type": "code", "execution_count": null, "id": "b39be718-c0ae-4aae-b1d8-d0c872947ec2", "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "# Load gene ID to name mapping and apply to dataset\n", "\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " try:\n", " # Load gene ID → name mapping\n", " with open(\"gene_id_to_name.json\", \"r\") as f:\n", " gene_id_dict = json.load(f)\n", " \n", " print(f\"āœ… Loaded mapping for {len(gene_id_dict)} gene IDs\")\n", " \n", " # Function to convert gene IDs to gene names\n", " def get_gene_names(gene_id_str):\n", " if pd.isna(gene_id_str) or not gene_id_str.strip():\n", " return \"\"\n", " \n", " gene_ids = [gid.strip() for gid in gene_id_str.split(\",\") if gid.strip()]\n", " gene_names = []\n", " \n", " for gid in gene_ids:\n", " gene_name = gene_id_dict.get(gid, f\"Unknown_ID_{gid}\")\n", " gene_names.append(gene_name)\n", " \n", " return \" | \".join(gene_names)\n", " \n", " # Apply mapping to create gene names column\n", " print(\"šŸ“Š Applying gene name mapping...\")\n", " clinvar_data[\"GENE_Name\"] = clinvar_data[\"GENE_ID\"].apply(get_gene_names)\n", " \n", " # Statistics\n", " mapped_count = (clinvar_data[\"GENE_Name\"] != \"\").sum()\n", " print(f\"āœ… Gene names mapped for {mapped_count} entries ({mapped_count/len(clinvar_data)*100:.1f}%)\")\n", " \n", " # Show sample mappings\n", " sample_data = clinvar_data[clinvar_data[\"GENE_Name\"] != \"\"][[\"GENE_ID\", \"GENE_Name\"]].head()\n", " if not sample_data.empty:\n", " print(\"\\nšŸ” Sample gene ID to name mappings:\")\n", " for _, row in sample_data.iterrows():\n", " print(f\" {row['GENE_ID']} → {row['GENE_Name'][:100]}{'...' if len(row['GENE_Name']) > 100 else ''}\")\n", " \n", " except FileNotFoundError:\n", " print(\"āŒ Error: gene_id_to_name.json not found\")\n", " print(\"Please run the gene mapping script first: ./gene_mapping.sh\")\n", " # Create empty column as fallback\n", " clinvar_data[\"GENE_Name\"] = \"\"\n", " except json.JSONDecodeError as e:\n", " print(f\"āŒ Error parsing JSON mapping file: {e}\")\n", " clinvar_data[\"GENE_Name\"] = \"\"\n", " except Exception as e:\n", " print(f\"āŒ Error applying gene mapping: {e}\")\n", " clinvar_data[\"GENE_Name\"] = \"\"\n", "else:\n", " print(\"āš ļø Cannot apply gene mapping - data not available\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4b7a44c2-7823-47c1-b268-22a1815ffd09", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CHROMPOSREFALTLABELSOURCECONSEQUENCEIDREVIEW_STATUSGENEsplitINT_LABELGENE_IDDiseaseGENE_Name
0chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedPERM1train184808Renal tubular epithelial cell apoptosisPPARGC1 and ESRR induced regulator, muscle 1
1chr1976215AGPathogenicClinVarmissense_variant1320032no_assertion_criteria_providedPERM1train184808Neutrophil inclusion bodiesPPARGC1 and ESRR induced regulator, muscle 1
2chr11050449GAPathogenicClinVarmissense_variant1284257no_assertion_criteria_providedAGRNtrain1375790Congenital myasthenic syndrome 8agrin
3chr11050575GCPathogenicClinVarmissense_variant18241no_assertion_criteria_providedAGRNtrain1375790Congenital myasthenic syndrome 8agrin
4chr11213738GAPathogenicClinVarmissense_variant96692no_assertion_criteria_providedTNFRSF4train17293Combined immunodeficiency due to OX40 deficiencyTNF receptor superfamily member 4
................................................
32680chrY2787412CTPathogenicClinVarmissense_variant9747no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1sex determining region Y
32681chrY2787426CGPathogenicClinVarmissense_variant9739criteria_provided,_single_submitterSRYtrain16736not providedsex determining region Y
32682chrY2787515CAPathogenicClinVarmissense_variant492908no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1sex determining region Y
32683chrY2787551CTPathogenicClinVarmissense_variant9754no_assertion_criteria_providedSRYtrain1673646,XY sex reversal 1sex determining region Y
32684chrY7063898ATPathogenicClinVarmissense_variant625467no_assertion_criteria_providedLOC126057105, TBL1Ytrain1126057105, 90665Deafness, Y-linked 2P300/CBP strongly-dependent group 1 enhancer G...
\n", "

32685 rows Ɨ 15 columns

\n", "
" ], "text/plain": [ " CHROM POS REF ALT LABEL SOURCE CONSEQUENCE ID \\\n", "0 chr1 976215 A G Pathogenic ClinVar missense_variant 1320032 \n", "1 chr1 976215 A G Pathogenic ClinVar missense_variant 1320032 \n", "2 chr1 1050449 G A Pathogenic ClinVar missense_variant 1284257 \n", "3 chr1 1050575 G C Pathogenic ClinVar missense_variant 18241 \n", "4 chr1 1213738 G A Pathogenic ClinVar missense_variant 96692 \n", "... ... ... .. .. ... ... ... ... \n", "32680 chrY 2787412 C T Pathogenic ClinVar missense_variant 9747 \n", "32681 chrY 2787426 C G Pathogenic ClinVar missense_variant 9739 \n", "32682 chrY 2787515 C A Pathogenic ClinVar missense_variant 492908 \n", "32683 chrY 2787551 C T Pathogenic ClinVar missense_variant 9754 \n", "32684 chrY 7063898 A T Pathogenic ClinVar missense_variant 625467 \n", "\n", " REVIEW_STATUS GENE split \\\n", "0 no_assertion_criteria_provided PERM1 train \n", "1 no_assertion_criteria_provided PERM1 train \n", "2 no_assertion_criteria_provided AGRN train \n", "3 no_assertion_criteria_provided AGRN train \n", "4 no_assertion_criteria_provided TNFRSF4 train \n", "... ... ... ... \n", "32680 no_assertion_criteria_provided SRY train \n", "32681 criteria_provided,_single_submitter SRY train \n", "32682 no_assertion_criteria_provided SRY train \n", "32683 no_assertion_criteria_provided SRY train \n", "32684 no_assertion_criteria_provided LOC126057105, TBL1Y train \n", "\n", " INT_LABEL GENE_ID \\\n", "0 1 84808 \n", "1 1 84808 \n", "2 1 375790 \n", "3 1 375790 \n", "4 1 7293 \n", "... ... ... \n", "32680 1 6736 \n", "32681 1 6736 \n", "32682 1 6736 \n", "32683 1 6736 \n", "32684 1 126057105, 90665 \n", "\n", " Disease \\\n", "0 Renal tubular epithelial cell apoptosis \n", "1 Neutrophil inclusion bodies \n", "2 Congenital myasthenic syndrome 8 \n", "3 Congenital myasthenic syndrome 8 \n", "4 Combined immunodeficiency due to OX40 deficiency \n", "... ... \n", "32680 46,XY sex reversal 1 \n", "32681 not provided \n", "32682 46,XY sex reversal 1 \n", "32683 46,XY sex reversal 1 \n", "32684 Deafness, Y-linked 2 \n", "\n", " GENE_Name \n", "0 PPARGC1 and ESRR induced regulator, muscle 1 \n", "1 PPARGC1 and ESRR induced regulator, muscle 1 \n", "2 agrin \n", "3 agrin \n", "4 TNF receptor superfamily member 4 \n", "... ... \n", "32680 sex determining region Y \n", "32681 sex determining region Y \n", "32682 sex determining region Y \n", "32683 sex determining region Y \n", "32684 P300/CBP strongly-dependent group 1 enhancer G... \n", "\n", "[32685 rows x 15 columns]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display final dataset with all extracted information\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " print(f\"šŸ“Š Final dataset shape: {clinvar_data.shape}\")\n", " print(f\"šŸ“‹ Columns: {list(clinvar_data.columns)}\")\n", " \n", " # Data completeness statistics\n", " print(\"\\nšŸ“ˆ Data Completeness:\")\n", " for col in ['GENE', 'GENE_ID', 'GENE_Name', 'Disease']:\n", " if col in clinvar_data.columns:\n", " filled_count = (clinvar_data[col] != '').sum()\n", " print(f\" {col}: {filled_count}/{len(clinvar_data)} ({filled_count/len(clinvar_data)*100:.1f}%)\")\n", " \n", " # Sample data\n", " print(\"\\nšŸ” Sample data:\")\n", " display(clinvar_data.head())\n", " \n", " # Memory usage\n", " memory_mb = clinvar_data.memory_usage(deep=True).sum() / 1024 / 1024\n", " print(f\"\\nšŸ’¾ Dataset memory usage: {memory_mb:.1f} MB\")\n", "else:\n", " print(\"āŒ No final data to display\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c545ae83-5cd1-4e29-87fd-69389bdb153f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'P300/CBP strongly-dependent group 1 enhancer GRCh37_chrY:6931456-6932655| transducin beta like 1 Y-linked'" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show example of gene name mapping\n", "if 'clinvar_data' in locals() and clinvar_data is not None and len(clinvar_data) > 32684:\n", " example_gene_name = clinvar_data.iloc[32684]['GENE_Name']\n", " example_gene_id = clinvar_data.iloc[32684]['GENE_ID']\n", " \n", " print(f\"šŸ” Example gene mapping for row 32684:\")\n", " print(f\" Gene ID: {example_gene_id}\")\n", " print(f\" Gene Name: {example_gene_name}\")\n", "else:\n", " # Show any available example\n", " if 'clinvar_data' in locals() and clinvar_data is not None and not clinvar_data.empty:\n", " # Find first row with gene name data\n", " example_row = clinvar_data[clinvar_data['GENE_Name'] != ''].iloc[0] if (clinvar_data['GENE_Name'] != '').any() else clinvar_data.iloc[0]\n", " \n", " print(f\"šŸ” Example gene mapping:\")\n", " print(f\" Gene ID: {example_row.get('GENE_ID', 'N/A')}\")\n", " print(f\" Gene Name: {example_row.get('GENE_Name', 'N/A')}\")\n", " else:\n", " print(\"āŒ No data available for example\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a214c29d-a4f1-4af6-a914-e6b4a14a1c49", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# Save the final processed dataset\n", "if 'clinvar_data' in locals() and clinvar_data is not None:\n", " output_file = \"clinvar_with_disease.csv\"\n", " \n", " try:\n", " clinvar_data.to_csv(output_file, index=False)\n", " \n", " print(f\"āœ… Final dataset saved to {output_file}\")\n", " print(f\"šŸ“Š Saved {len(clinvar_data)} records with {len(clinvar_data.columns)} columns\")\n", " \n", " # File size\n", " file_size = os.path.getsize(output_file) / 1024 / 1024\n", " print(f\"šŸ’¾ File size: {file_size:.1f} MB\")\n", " \n", " # Summary of what was accomplished\n", " print(\"\\nšŸŽÆ Processing Summary:\")\n", " print(f\" āœ“ Extracted ClinVar coding variants\")\n", " print(f\" āœ“ Parsed XML records for gene information\")\n", " print(f\" āœ“ Mapped diseases/phenotypes\")\n", " print(f\" āœ“ Added human-readable gene names\")\n", " print(f\" āœ“ Created comprehensive dataset\")\n", " \n", " except Exception as e:\n", " print(f\"āŒ Error saving dataset: {e}\")\n", "else:\n", " print(\"āš ļø No data available to save\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b6c4c1f4-4b87-4624-8f8a-c568e40b2e63", "metadata": {}, "outputs": [], "source": [ "import os\n", "import shutil\n", "\n", "# Optional: Clean up temporary XML data directory\n", "# Uncomment the following lines if you want to remove the XML files to save space\n", "\n", "if os.path.exists(\"data\") and os.path.isdir(\"data\"):\n", " # Count files before cleanup\n", " xml_files = [f for f in os.listdir(\"data\") if f.endswith('.xml')]\n", " \n", " print(f\"šŸ—‚ļø Found {len(xml_files)} XML files in data directory\")\n", " \n", " # Uncomment to actually remove the directory\n", " # shutil.rmtree(\"data\")\n", " # print(\"šŸ—‘ļø Removed temporary XML data directory\")\n", " \n", " print(\"ā„¹ļø XML files preserved. Uncomment the cleanup code to remove them.\")\n", "else:\n", " print(\"ā„¹ļø No XML data directory found to clean up\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c08beea6-6ff7-4900-a8b8-8a719db36189", "metadata": {}, "outputs": [], "source": [ "## Processing Complete āœ…\n", "\n", "The ClinVar coding variants have been successfully processed with the following enhancements:\n", "\n", "### Generated Files:\n", "- `clinvar_coding_raw.csv` - Raw ClinVar entries extracted from VEP data\n", "- `Clinvar_ID.txt` - List of ClinVar IDs for processing\n", "- `gene_id.txt` - Unique gene IDs for name mapping\n", "- `gene_id_to_name.json` - Gene ID to name mapping dictionary\n", "- `clinvar_with_disease.csv` - **Final comprehensive dataset**\n", "\n", "### Dataset Features:\n", "- **Variant Information**: Genomic coordinates, alleles, and annotations\n", "- **Gene Data**: Symbols, IDs, and human-readable names\n", "- **Disease/Phenotype**: Associated conditions and clinical significance\n", "- **Expanded Format**: One row per variant-disease combination\n", "\n", "### Next Steps:\n", "1. **Quality Control**: Review the data for completeness and accuracy\n", "2. **Analysis**: Use the dataset for downstream genetic analysis\n", "3. **Integration**: Combine with other datasets as needed\n", "4. **Documentation**: Update metadata and create data dictionary\n", "\n", "### File Cleanup:\n", "- XML files in `data/` directory can be removed to save space\n", "- Intermediate files can be archived or removed as needed" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }