{ "cells": [ { "cell_type": "markdown", "id": "83c9cd1f", "metadata": {}, "source": [ "## Setup and Data Preparation\n", "\n", "Initial setup steps to prepare the working environment and extract ClinVar data." ] }, { "cell_type": "markdown", "id": "81a36253-9050-4d58-96cd-8238aae51e0e", "metadata": {}, "source": [ "# ClinVar Coding Variants Data Processing\n", "\n", "This notebook processes ClinVar coding variants data by extracting additional information including gene names, gene IDs, and associated diseases from ClinVar XML records.\n", "\n", "## Overview\n", "\n", "The workflow includes:\n", "1. **Data Extraction**: Filter ClinVar entries from VEP-annotated pathogenic coding variants\n", "2. **XML Processing**: Parse ClinVar XML records to extract gene and disease information\n", "3. **Gene Annotation**: Map gene IDs to gene names using NCBI Entrez utilities\n", "4. **Data Integration**: Combine all information into a comprehensive dataset\n", "\n", "## Requirements\n", "\n", "- Python 3.7+\n", "- pandas library\n", "- xml.etree.ElementTree (built-in)\n", "- NCBI Entrez Direct tools (for gene name mapping)\n", "- Input data: VEP-annotated pathogenic coding variants CSV file\n", "\n", "## Data Structure\n", "\n", "The processing creates a dataset with the following key columns:\n", "- Variant information (chromosome, position, alleles)\n", "- ClinVar ID and significance\n", "- Gene symbols and IDs\n", "- Associated disease/phenotype information" ] }, { "cell_type": "code", "execution_count": null, "id": "cb351234-50a3-4061-81ce-bdce5343e790", "metadata": {}, "outputs": [], "source": [ "# Create working directory for ClinVar data processing\n", "import os\n", "os.makedirs('clinvar', exist_ok=True)\n", "print(\"ā Created 'clinvar' directory\")" ] }, { "cell_type": "code", "execution_count": null, "id": "443ccab8-50a1-45ae-950c-8425eb318e93", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# Navigate to clinvar directory\n", "os.chdir('clinvar')\n", "print(f\"š Current working directory: {os.getcwd()}\")\n", "\n", "with open('vep_pathogenic_coding.csv') as infile, open('clinvar_coding_raw.csv', 'w') as outfile:\n", " for line in infile:\n", " if 'ClinVar' in line:\n", " outfile.write(line)" ] }, { "cell_type": "code", "execution_count": null, "id": "e1f92675-b85c-4baa-8680-9c3776e04ac9", "metadata": {}, "outputs": [], "source": [ "# Extract ClinVar entries from VEP-annotated pathogenic coding variants\n", "# Note: Update the input file path to match your data location\n", "input_file = \"../data/vep_pathogenic_coding.csv\" # Adjust path as needed\n", "output_file = \"clinvar_coding_raw.csv\"\n", "\n", "# Use shell command to filter ClinVar entries\n", "import subprocess\n", "try:\n", " result = subprocess.run(\n", " [\"grep\", \"ClinVar\", input_file],\n", " capture_output=True,\n", " text=True,\n", " check=True\n", " )\n", " \n", " with open(output_file, 'w') as f:\n", " f.write(result.stdout)\n", " \n", " print(f\"ā Extracted ClinVar entries to {output_file}\")\n", " print(f\"š Found {len(result.stdout.strip().split('\\n'))} ClinVar entries\")\n", " \n", "except subprocess.CalledProcessError:\n", " print(f\"ā Error: Could not find ClinVar entries in {input_file}\")\n", " print(\"Please ensure the input file exists and contains ClinVar annotations\")\n", "except FileNotFoundError:\n", " print(f\"ā Error: Input file {input_file} not found\")\n", " print(\"Please update the input_file path to point to your VEP-annotated data\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7e560308-135b-4189-9146-ff50845839a4", "metadata": {}, "outputs": [], "source": [ "# Extract ClinVar IDs from the filtered data (assuming ID is in column 8)\n", "# Note: Adjust column number if your data structure is different\n", "import pandas as pd\n", "\n", "try:\n", " # Read the raw ClinVar data to determine structure\n", " df_temp = pd.read_csv(\"clinvar_coding_raw.csv\")\n", " print(f\"š Data shape: {df_temp.shape}\")\n", " print(f\"š Columns: {list(df_temp.columns)}\")\n", " \n", " # Extract ClinVar IDs (adjust column index as needed)\n", " # Column 8 corresponds to index 7 in Python (0-based)\n", " if df_temp.shape[1] >= 8:\n", " clinvar_ids = df_temp.iloc[:, 7] # 8th column (0-based index 7)\n", " \n", " # Save IDs to file\n", " with open(\"Clinvar_ID.txt\", 'w') as f:\n", " for id_val in clinvar_ids:\n", " if pd.notna(id_val):\n", " f.write(f\"{id_val}\\n\")\n", " \n", " print(f\"ā Extracted {len(clinvar_ids.dropna())} ClinVar IDs to Clinvar_ID.txt\")\n", " else:\n", " print(f\"ā Error: Expected at least 8 columns, found {df_temp.shape[1]}\")\n", " \n", "except FileNotFoundError:\n", " print(\"ā Error: clinvar_coding_raw.csv not found\")\n", " print(\"Please run the previous cell first to extract ClinVar data\")\n", "except Exception as e:\n", " print(f\"ā Error processing ClinVar data: {e}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "53b0dfd8-8d49-4c3f-adb4-4c6bfbffcfa9", "metadata": {}, "outputs": [], "source": [ "chmod +x Clinvar_esearch.sh\n", "\n", "## XML Data Retrieval\n", "\n", "**Note**: This step requires creating a shell script (`Clinvar_esearch.sh`) to fetch XML data from NCBI.\n", "\n", "The script should:\n", "1. Read ClinVar IDs from `Clinvar_ID.txt`\n", "2. Use NCBI Entrez Direct tools to fetch XML records\n", "3. Save XML files in a `data/` subdirectory\n", "\n", "Example script content:\n", "```bash\n", "#!/bin/bash\n", "mkdir -p data\n", "while read -r id; do\n", " esearch -db clinvar -query \"$id\" | efetch -format xml > \"data/${id}.xml\"\n", " echo \"Downloaded XML for ClinVar ID: $id\"\n", "done < Clinvar_ID.txt\n", "```\n", "\n", "**Prerequisites**: Install NCBI Entrez Direct tools:\n", "- macOS: `brew install brewsci/bio/edirect`\n", "- Linux: Follow NCBI EDirect installation guide" ] }, { "cell_type": "code", "execution_count": null, "id": "0755ad6d", "metadata": {}, "outputs": [], "source": [ "# Parsing XML for Gene and Disease\n", "\n", "# Make the ClinVar search script executable and run it\n", "# Note: This assumes you have created the Clinvar_esearch.sh script\n", "\n", "import os\n", "import subprocess\n", "\n", "script_path = \"Clinvar_esearch.sh\"\n", "\n", "if os.path.exists(script_path):\n", " # Make script executable\n", " os.chmod(script_path, 0o755)\n", " print(f\"ā Made {script_path} executable\")\n", " \n", " # Optionally run the script (uncomment if you want to execute automatically)\n", " # print(\"š Running ClinVar XML download script...\")\n", " # result = subprocess.run([f\"./{script_path}\"], capture_output=True, text=True)\n", " # if result.returncode == 0:\n", " # print(\"ā XML download completed successfully\")\n", " # else:\n", " # print(f\"ā Script execution failed: {result.stderr}\")\n", "else:\n", " print(f\"ā ļø Warning: {script_path} not found\")\n", " print(\"Please create this script manually to download ClinVar XML data\")\n", " print(\"See the documentation in the previous cell for script template\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d21a188b-a0dc-4af2-9b71-5a44d8cd4673", "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import xml.etree.ElementTree as ET\n", "import json\n", "import os\n", "from pathlib import Path\n", "\n", "print(\"š Libraries imported successfully\")\n", "print(f\"š Current directory: {os.getcwd()}\")\n", "print(f\"š Pandas version: {pd.__version__}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1365615b-ee81-4df0-9fca-df001e9f01d4", "metadata": {}, "outputs": [], "source": [ "# Load the raw ClinVar data\n", "try:\n", " clinvar_raw = pd.read_csv(\"clinvar_coding_raw.csv\")\n", " print(f\"ā Loaded ClinVar data: {clinvar_raw.shape[0]} rows, {clinvar_raw.shape[1]} columns\")\n", " print(f\"š Columns: {list(clinvar_raw.columns)[:10]}\") # Show first 10 columns\n", " \n", "except FileNotFoundError:\n", " print(\"ā Error: clinvar_coding_raw.csv not found\")\n", " print(\"Please run the data extraction steps first\")\n", " clinvar_raw = None\n", "except Exception as e:\n", " print(f\"ā Error loading data: {e}\")\n", " clinvar_raw = None" ] }, { "cell_type": "code", "execution_count": null, "id": "7144ddf2-abf7-4680-b578-d4bd4b7195ea", "metadata": {}, "outputs": [], "source": [ "# Remove unnecessary columns to streamline the dataset\n", "# Note: Adjust column names based on your actual data structure\n", "\n", "if clinvar_raw is not None:\n", " columns_to_remove = [\n", " \"GENOMIC_MUTATION_ID\", \"N_SAMPLES\", \"TOTAL_SAMPLES\", \"FREQ\", \n", " \"OMIM\", \"PMID\", \"AC\", \"AN\", \"AF\", \"MAF\", \"MAC\"\n", " ]\n", " \n", " # Only remove columns that actually exist in the dataset\n", " existing_columns = [col for col in columns_to_remove if col in clinvar_raw.columns]\n", " missing_columns = [col for col in columns_to_remove if col not in clinvar_raw.columns]\n", " \n", " if existing_columns:\n", " clinvar_raw = clinvar_raw.drop(columns=existing_columns)\n", " print(f\"ā Removed {len(existing_columns)} columns: {existing_columns}\")\n", " \n", " if missing_columns:\n", " print(f\"ā¹ļø Columns not found (skipped): {missing_columns}\")\n", " \n", " print(f\"š Remaining columns: {clinvar_raw.shape[1]}\")\n", "else:\n", " print(\"ā ļø Skipping column removal - data not loaded\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fbffd3cd-7df3-43e2-8d73-01f54e8d1da6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | CHROM | \n", "POS | \n", "REF | \n", "ALT | \n", "LABEL | \n", "SOURCE | \n", "CONSEQUENCE | \n", "ID | \n", "REVIEW_STATUS | \n", "GENE | \n", "split | \n", "INT_LABEL | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 1 | \n", "chr1 | \n", "1050449 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1284257 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 2 | \n", "chr1 | \n", "1050575 | \n", "G | \n", "C | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "18241 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 3 | \n", "chr1 | \n", "1213738 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "96692 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 4 | \n", "chr1 | \n", "1232279 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "initiatior_codon_variant,missense_variant | \n", "60484 | \n", "criteria_provided,_multiple_submitters,_no_con... | \n", "NaN | \n", "train | \n", "1 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 22249 | \n", "chrY | \n", "2787412 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9747 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 22250 | \n", "chrY | \n", "2787426 | \n", "C | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9739 | \n", "criteria_provided,_single_submitter | \n", "NaN | \n", "train | \n", "1 | \n", "
| 22251 | \n", "chrY | \n", "2787515 | \n", "C | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "492908 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 22252 | \n", "chrY | \n", "2787551 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9754 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
| 22253 | \n", "chrY | \n", "7063898 | \n", "A | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "625467 | \n", "no_assertion_criteria_provided | \n", "NaN | \n", "train | \n", "1 | \n", "
22254 rows Ć 12 columns
\n", "| \n", " | CHROM | \n", "POS | \n", "REF | \n", "ALT | \n", "LABEL | \n", "SOURCE | \n", "CONSEQUENCE | \n", "ID | \n", "REVIEW_STATUS | \n", "GENE | \n", "split | \n", "INT_LABEL | \n", "GENE_ID | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 1 | \n", "chr1 | \n", "1050449 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1284257 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 2 | \n", "chr1 | \n", "1050575 | \n", "G | \n", "C | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "18241 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 3 | \n", "chr1 | \n", "1213738 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "96692 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 4 | \n", "chr1 | \n", "1232279 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "initiatior_codon_variant,missense_variant | \n", "60484 | \n", "criteria_provided,_multiple_submitters,_no_con... | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 22249 | \n", "chrY | \n", "2787412 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9747 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 22250 | \n", "chrY | \n", "2787426 | \n", "C | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9739 | \n", "criteria_provided,_single_submitter | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 22251 | \n", "chrY | \n", "2787515 | \n", "C | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "492908 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 22252 | \n", "chrY | \n", "2787551 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9754 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
| 22253 | \n", "chrY | \n", "7063898 | \n", "A | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "625467 | \n", "no_assertion_criteria_provided | \n", "\n", " | train | \n", "1 | \n", "\n", " |
22254 rows Ć 13 columns
\n", "| \n", " | CHROM | \n", "POS | \n", "REF | \n", "ALT | \n", "LABEL | \n", "SOURCE | \n", "CONSEQUENCE | \n", "ID | \n", "REVIEW_STATUS | \n", "GENE | \n", "split | \n", "INT_LABEL | \n", "GENE_ID | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "PERM1 | \n", "train | \n", "1 | \n", "84808 | \n", "
| 1 | \n", "chr1 | \n", "1050449 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1284257 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "
| 2 | \n", "chr1 | \n", "1050575 | \n", "G | \n", "C | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "18241 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "
| 3 | \n", "chr1 | \n", "1213738 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "96692 | \n", "no_assertion_criteria_provided | \n", "TNFRSF4 | \n", "train | \n", "1 | \n", "7293 | \n", "
| 4 | \n", "chr1 | \n", "1232279 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "initiatior_codon_variant,missense_variant | \n", "60484 | \n", "criteria_provided,_multiple_submitters,_no_con... | \n", "B3GALT6 | \n", "train | \n", "1 | \n", "126792 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 22249 | \n", "chrY | \n", "2787412 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9747 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "
| 22250 | \n", "chrY | \n", "2787426 | \n", "C | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9739 | \n", "criteria_provided,_single_submitter | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "
| 22251 | \n", "chrY | \n", "2787515 | \n", "C | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "492908 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "
| 22252 | \n", "chrY | \n", "2787551 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9754 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "
| 22253 | \n", "chrY | \n", "7063898 | \n", "A | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "625467 | \n", "no_assertion_criteria_provided | \n", "LOC126057105, TBL1Y | \n", "train | \n", "1 | \n", "126057105, 90665 | \n", "
22150 rows Ć 13 columns
\n", "| \n", " | CHROM | \n", "POS | \n", "REF | \n", "ALT | \n", "LABEL | \n", "SOURCE | \n", "CONSEQUENCE | \n", "ID | \n", "REVIEW_STATUS | \n", "GENE | \n", "split | \n", "INT_LABEL | \n", "GENE_ID | \n", "Disease | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "PERM1 | \n", "train | \n", "1 | \n", "84808 | \n", "Renal tubular epithelial cell apoptosis | \n", "
| 1 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "PERM1 | \n", "train | \n", "1 | \n", "84808 | \n", "Neutrophil inclusion bodies | \n", "
| 2 | \n", "chr1 | \n", "1050449 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1284257 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "Congenital myasthenic syndrome 8 | \n", "
| 3 | \n", "chr1 | \n", "1050575 | \n", "G | \n", "C | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "18241 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "Congenital myasthenic syndrome 8 | \n", "
| 4 | \n", "chr1 | \n", "1213738 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "96692 | \n", "no_assertion_criteria_provided | \n", "TNFRSF4 | \n", "train | \n", "1 | \n", "7293 | \n", "Combined immunodeficiency due to OX40 deficiency | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 32680 | \n", "chrY | \n", "2787412 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9747 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "
| 32681 | \n", "chrY | \n", "2787426 | \n", "C | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9739 | \n", "criteria_provided,_single_submitter | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "not provided | \n", "
| 32682 | \n", "chrY | \n", "2787515 | \n", "C | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "492908 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "
| 32683 | \n", "chrY | \n", "2787551 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9754 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "
| 32684 | \n", "chrY | \n", "7063898 | \n", "A | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "625467 | \n", "no_assertion_criteria_provided | \n", "LOC126057105, TBL1Y | \n", "train | \n", "1 | \n", "126057105, 90665 | \n", "Deafness, Y-linked 2 | \n", "
32685 rows Ć 14 columns
\n", "| \n", " | CHROM | \n", "POS | \n", "REF | \n", "ALT | \n", "LABEL | \n", "SOURCE | \n", "CONSEQUENCE | \n", "ID | \n", "REVIEW_STATUS | \n", "GENE | \n", "split | \n", "INT_LABEL | \n", "GENE_ID | \n", "Disease | \n", "GENE_Name | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "PERM1 | \n", "train | \n", "1 | \n", "84808 | \n", "Renal tubular epithelial cell apoptosis | \n", "PPARGC1 and ESRR induced regulator, muscle 1 | \n", "
| 1 | \n", "chr1 | \n", "976215 | \n", "A | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1320032 | \n", "no_assertion_criteria_provided | \n", "PERM1 | \n", "train | \n", "1 | \n", "84808 | \n", "Neutrophil inclusion bodies | \n", "PPARGC1 and ESRR induced regulator, muscle 1 | \n", "
| 2 | \n", "chr1 | \n", "1050449 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "1284257 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "Congenital myasthenic syndrome 8 | \n", "agrin | \n", "
| 3 | \n", "chr1 | \n", "1050575 | \n", "G | \n", "C | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "18241 | \n", "no_assertion_criteria_provided | \n", "AGRN | \n", "train | \n", "1 | \n", "375790 | \n", "Congenital myasthenic syndrome 8 | \n", "agrin | \n", "
| 4 | \n", "chr1 | \n", "1213738 | \n", "G | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "96692 | \n", "no_assertion_criteria_provided | \n", "TNFRSF4 | \n", "train | \n", "1 | \n", "7293 | \n", "Combined immunodeficiency due to OX40 deficiency | \n", "TNF receptor superfamily member 4 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 32680 | \n", "chrY | \n", "2787412 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9747 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "sex determining region Y | \n", "
| 32681 | \n", "chrY | \n", "2787426 | \n", "C | \n", "G | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9739 | \n", "criteria_provided,_single_submitter | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "not provided | \n", "sex determining region Y | \n", "
| 32682 | \n", "chrY | \n", "2787515 | \n", "C | \n", "A | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "492908 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "sex determining region Y | \n", "
| 32683 | \n", "chrY | \n", "2787551 | \n", "C | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "9754 | \n", "no_assertion_criteria_provided | \n", "SRY | \n", "train | \n", "1 | \n", "6736 | \n", "46,XY sex reversal 1 | \n", "sex determining region Y | \n", "
| 32684 | \n", "chrY | \n", "7063898 | \n", "A | \n", "T | \n", "Pathogenic | \n", "ClinVar | \n", "missense_variant | \n", "625467 | \n", "no_assertion_criteria_provided | \n", "LOC126057105, TBL1Y | \n", "train | \n", "1 | \n", "126057105, 90665 | \n", "Deafness, Y-linked 2 | \n", "P300/CBP strongly-dependent group 1 enhancer G... | \n", "
32685 rows Ć 15 columns
\n", "