{ "cells": [ { "cell_type": "markdown", "id": "0103d07d", "metadata": {}, "source": [ "# KEGG Data Processing Pipeline - Part 2: Variant Information Parsing and Sequence Generation\n", "\n", "## Overview\n", "\n", "This notebook is the second part of the KEGG data processing pipeline. It focuses on parsing variant information from KEGG data, generating nucleotide sequences with mutations, and creating disease mapping databases.\n", "\n", "## What This Notebook Does\n", "\n", "1. **Variant Information Parsing**: Extracts detailed information from KEGG variant files\n", "2. **Sequence Generation**: Creates reference and variant nucleotide sequences with genomic context\n", "3. **Disease Mapping**: Downloads and processes KEGG disease information\n", "4. **Data Integration**: Merges variant data with genomic sequences and disease annotations\n", "5. **Quality Control**: Validates reference sequences against the genome\n", "\n", "## Prerequisites\n", "\n", "**Required from Part 1 (KEGG_Data_1.ipynb):**\n", "- `gene_variants.txt` - List of variant identifiers\n", "- `variant_info/` directory - Individual variant information files\n", "- `final_network_with_variant.tsv` - Network and variant mapping data\n", "\n", "**Additional Requirements:**\n", "- Reference genome FASTA file (GRCh38)\n", "- BioPython for sequence processing\n", "- KEGG_pull for disease information retrieval\n", "\n", "## Required Packages\n", "\n", "```bash\n", "pip install biopython pandas kegg-pull\n", "```\n", "\n", "## Input Files Expected\n", "\n", "- `gene_variants.txt` - Variant identifiers from Part 1\n", "- `variant_info/*.txt` - Individual variant information files\n", "- `chromosomes.fasta` - Reference genome sequences\n", "- `final_network_with_variant.tsv` - Network-variant mapping\n", "\n", "## Output Files Generated\n", "\n", "- `nt_seq/` - Directory containing reference and variant sequences\n", "- `verification.txt` - Quality control results\n", "- `diseases.txt` - List of disease identifiers\n", "- `disease_info/` - Disease information files\n", "- Updated `final_network_with_variant.tsv` with disease names\n", "\n", "## Important Notes\n", "\n", "- **Memory Usage**: Processing large genomic sequences requires significant RAM\n", "- **Storage**: Generated sequence files can be several GB in size\n", "- **Processing Time**: Full pipeline may take several hours depending on dataset size\n", "- **Dependencies**: Requires successful completion of KEGG_Data_1.ipynb\n", "\n", "## Next Steps\n", "\n", "After completing this notebook, run `KEGG_Data_3.ipynb` for final dataset creation and sequence integration." ] }, { "cell_type": "markdown", "id": "ccc3ca96", "metadata": {}, "source": [ "## Configuration\n", "\n", "Set up paths and parameters for variant processing:" ] }, { "cell_type": "code", "execution_count": null, "id": "28d2629e", "metadata": {}, "outputs": [], "source": [ "# Configuration - Update these paths for your environment\n", "import os\n", "from pathlib import Path\n", "\n", "# Navigate to kegg_data directory\n", "data_dir = Path('kegg_data')\n", "if not data_dir.exists():\n", " print(\"āŒ kegg_data directory not found. Please run KEGG_Data_1.ipynb first.\")\n", " raise FileNotFoundError(\"kegg_data directory missing\")\n", "\n", "os.chdir(data_dir)\n", "\n", "# Configuration parameters\n", "CONFIG = {\n", " # Input files (should exist from Part 1)\n", " 'gene_variants_file': 'gene_variants.txt',\n", " 'variant_info_dir': 'variant_info',\n", " 'network_data_file': 'final_network_with_variant.tsv',\n", " \n", " # Reference genome (update path as needed)\n", " 'reference_fasta': 'chromosomes.fasta', # Update to your reference genome path\n", " \n", " # Output directories\n", " 'nt_seq_dir': 'nt_seq',\n", " 'disease_info_dir': 'disease_info',\n", " \n", " # Processing parameters\n", " 'sequence_window': 2000, # Nucleotides around variant\n", " 'verification_file': 'verification.txt',\n", " 'diseases_file': 'diseases.txt'\n", "}\n", "\n", "# Verify required input files\n", "required_files = ['gene_variants.txt', 'final_network_with_variant.tsv']\n", "missing_files = []\n", "for file in required_files:\n", " if not os.path.exists(file):\n", " missing_files.append(file)\n", "\n", "if missing_files:\n", " print(f\"āŒ Missing required files: {missing_files}\")\n", " print(\"Please run KEGG_Data_1.ipynb first to generate these files.\")\n", "else:\n", " print(\"āœ… All required input files found\")\n", "\n", "# Create output directories\n", "for dir_name in [CONFIG['nt_seq_dir'], CONFIG['disease_info_dir']]:\n", " Path(dir_name).mkdir(exist_ok=True)\n", "\n", "print(f\"Working directory: {os.getcwd()}\")\n", "print(\"\\nšŸ“ Update CONFIG['reference_fasta'] with path to your reference genome file\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d05a4d10-03de-42ae-89c1-5ddbe77043a7", "metadata": {}, "outputs": [], "source": [ "# Working directory already set in configuration section above\n", "print(f\"Current working directory: {os.getcwd()}\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "96662dbb-ee2c-4a74-8e45-ab58a3496976", "metadata": {}, "outputs": [], "source": [ "sed -i '' 's/:/_/g' gene_variants.txt" ] }, { "cell_type": "code", "execution_count": 10, "id": "db4f4cf2-cd95-4df8-99b6-cc112857502f", "metadata": {}, "outputs": [], "source": [ "while read p; do\n", " if ! grep -q NAME variant_info/$p.txt; then\n", " echo \"$p\"\n", " fi\n", "done < gene_variants.txt" ] }, { "cell_type": "code", "execution_count": 12, "id": "11959296-d5cb-4fb4-9914-83596dd41c86", "metadata": {}, "outputs": [], "source": [ "while read p; do\n", " if ! grep -q GENE variant_info/$p.txt; then\n", " echo \"$p\"\n", " fi\n", "done < gene_variants.txt" ] }, { "cell_type": "markdown", "id": "784d0394-1a14-471a-9def-f4877b4bbd4e", "metadata": {}, "source": [ "# Pulling Info from the Variant File\n", "\n", "# Variant Information Parsing\n", "\n", "This section processes individual variant files to extract structured information including variant names, genes, and types." ] }, { "cell_type": "code", "execution_count": null, "id": "62b4167a-6d5a-4120-99fe-5678227db6cc", "metadata": {}, "outputs": [], "source": [ "# Working directory already set - proceeding with variant information parsing\n", "print(f\"Processing variant files from: {CONFIG['variant_info_dir']}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3ed32b62-e3a6-4cff-b4ab-a80f04725a1c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "from pathlib import Path\n", "import os\n", "\n", "# Read all file names from gene_variants.txt\n", "gene_variants_file = CONFIG['gene_variants_file']\n", "if not os.path.exists(gene_variants_file):\n", " print(f\"āŒ Gene variants file not found: {gene_variants_file}\")\n", " print(\"Please run KEGG_Data_1.ipynb first to generate this file\")\n", " raise FileNotFoundError(f\"Gene variants file not found: {gene_variants_file}\")\n", "\n", "with open(gene_variants_file, 'r') as f:\n", " variant_files = [line.strip() for line in f if line.strip()]\n", "\n", "print(f\"Processing {len(variant_files)} variant files\")\n", "\n", "# Initialize an empty DataFrame to collect the results\n", "variant_info = pd.DataFrame(columns=[\"Entry\", \"Variant_Name\", \"Variant_Gene\", \"Variant_Gene Info\", \"Variant_Type\"])\n", "\n", "# Function to extract the value after a keyword (single line, rest of the line)\n", "def extract_value(line, key):\n", " return line.split(key, 1)[-1].strip()\n", "\n", "# Process each variant file\n", "variant_info_dir = Path(CONFIG['variant_info_dir'])\n", "processed_count = 0\n", "not_found_count = 0\n", "\n", "for file_name in variant_files:\n", " file_path = variant_info_dir / f\"{file_name}.txt\"\n", "\n", " try:\n", " with open(file_path, 'r') as f:\n", " lines = f.readlines()\n", "\n", " name = \"\"\n", " gene = \"\"\n", " gene_info = \"\"\n", " type_info = \"\"\n", "\n", " for line in lines:\n", " line = line.strip()\n", " if line.startswith(\"NAME\"):\n", " name = extract_value(line, \"NAME\")\n", " elif line.startswith(\"GENE\"):\n", " gene_data = extract_value(line, \"GENE\")\n", " if gene_data:\n", " parts = gene_data.split(maxsplit=1)\n", " gene = parts[0]\n", " gene_info = parts[1] if len(parts) > 1 else \"\"\n", " elif line.startswith(\"TYPE\"):\n", " type_info = extract_value(line, \"TYPE\")\n", "\n", " row = {\n", " \"Entry\": file_name,\n", " \"Variant_Name\": name,\n", " \"Variant_Gene\": gene,\n", " \"Variant_Gene Info\": gene_info,\n", " \"Variant_Type\": type_info\n", " }\n", "\n", " variant_info = pd.concat([variant_info, pd.DataFrame([row])], ignore_index=True)\n", " processed_count += 1\n", " \n", " if processed_count % 100 == 0:\n", " print(f\"Processed {processed_count}/{len(variant_files)} files...\")\n", "\n", " except FileNotFoundError:\n", " print(f\"[Warning] File not found: {file_path}\")\n", " not_found_count += 1\n", "\n", "print(f\"āœ… Processing complete: {processed_count} files processed, {not_found_count} files not found\")\n", "print(f\"Extracted information for {len(variant_info)} variants\")\n", "\n", "# Optional: Save the final table\n", "# variant_info.to_csv(\"parsed_variant_info.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 32, "id": "85e94a07-740d-44cd-a1c3-2330a30b99b1", "metadata": {}, "outputs": [], "source": [ "variant_info[\"Entry\"] = variant_info[\"Entry\"].str.replace(\"hsa_var_\", \"\", regex=False)" ] }, { "cell_type": "code", "execution_count": 33, "id": "4fc8fa00-2a28-4bd9-9aed-5c4602969cca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EntryVariant_NameVariant_GeneVariant_Gene InfoVariant_Type
01019v2CDK4 mutationCDK4cyclin dependent kinase 4 [KO:K02089]
11027v3CDKN1B mutationCDKN1Bcyclin dependent kinase inhibitor 1B [KO:K06624]
210280v1SIGMAR1 mutationSIGMAR1sigma non-opioid intracellular receptor 1 [KO:...
31029v2CDKN2A mutationCDKN2Acyclin dependent kinase inhibitor 2A [KO:K06621]
411315v1PARK7 mutationPARK7Parkinsonism associated deglycase [KO:K05687]
..................
909049v1AIP mutationAIPAHR interacting HSP90 co-chaperone [KO:K17767]
919101v1USP8 mutationUSP8ubiquitin specific peptidase 8 [KO:K11839]
929217v1VAPB mutationVAPBVAMP associated protein B and C [KO:K10707]
939817v1KEAP1 mutationKEAP1kelch like ECH associated protein 1 [KO:K10456]
94999v2CDH1 mutationCDH1cadherin 1 [KO:K05689]
\n", "

95 rows Ɨ 5 columns

\n", "
" ], "text/plain": [ " Entry Variant_Name Variant_Gene \\\n", "0 1019v2 CDK4 mutation CDK4 \n", "1 1027v3 CDKN1B mutation CDKN1B \n", "2 10280v1 SIGMAR1 mutation SIGMAR1 \n", "3 1029v2 CDKN2A mutation CDKN2A \n", "4 11315v1 PARK7 mutation PARK7 \n", ".. ... ... ... \n", "90 9049v1 AIP mutation AIP \n", "91 9101v1 USP8 mutation USP8 \n", "92 9217v1 VAPB mutation VAPB \n", "93 9817v1 KEAP1 mutation KEAP1 \n", "94 999v2 CDH1 mutation CDH1 \n", "\n", " Variant_Gene Info Variant_Type \n", "0 cyclin dependent kinase 4 [KO:K02089] \n", "1 cyclin dependent kinase inhibitor 1B [KO:K06624] \n", "2 sigma non-opioid intracellular receptor 1 [KO:... \n", "3 cyclin dependent kinase inhibitor 2A [KO:K06621] \n", "4 Parkinsonism associated deglycase [KO:K05687] \n", ".. ... ... \n", "90 AHR interacting HSP90 co-chaperone [KO:K17767] \n", "91 ubiquitin specific peptidase 8 [KO:K11839] \n", "92 VAMP associated protein B and C [KO:K10707] \n", "93 kelch like ECH associated protein 1 [KO:K10456] \n", "94 cadherin 1 [KO:K05689] \n", "\n", "[95 rows x 5 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "variant_info" ] }, { "cell_type": "markdown", "id": "485ddbd6", "metadata": {}, "source": [ "# Creating the Nt Variant Database\n", "\n", "# Nucleotide Sequence Database Creation\n", "\n", "This section creates nucleotide sequences with genomic context around each variant, generating both reference and mutated sequences for downstream analysis." ] }, { "cell_type": "code", "execution_count": null, "id": "c8dba21b", "metadata": {}, "outputs": [], "source": [ "# Working directory already set - proceeding with nucleotide sequence processing\n", "print(\"Starting nucleotide variant database creation...\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "8cf9f795", "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "id": "0bc18349", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "\n", "# Load network and variant data\n", "network_file = CONFIG['network_data_file']\n", "if not os.path.exists(network_file):\n", " print(f\"āŒ Network data file not found: {network_file}\")\n", " print(\"Please run KEGG_Data_1.ipynb first to generate this file\")\n", " raise FileNotFoundError(f\"Network data not found: {network_file}\")\n", "\n", "variant_data = pd.read_csv(network_file, sep='\\t')\n", "print(f\"āœ… Loaded variant data: {len(variant_data)} entries\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "65dde804", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1449" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(variant_data)" ] }, { "cell_type": "code", "execution_count": 5, "id": "c042831c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'N00073'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "variant_data.iloc[1][\"Network\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "92e4699c", "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO\n", "import os\n", "\n", "# Assuming CONFIG is defined somewhere earlier in the code\n", "# CONFIG = {'reference_fasta': 'path_to_your_fasta_file'}\n", "\n", "# Load reference genome sequences\n", "fasta_file = CONFIG['reference_fasta']\n", "if not os.path.exists(fasta_file):\n", " print(f\"āŒ Reference genome file not found: {fasta_file}\")\n", " print(\"Please update CONFIG['reference_fasta'] with the correct path to your reference genome\")\n", " print(\"Download from: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/\")\n", " raise FileNotFoundError(f\"Reference genome not found: {fasta_file}\")\n", "\n", "print(f\"Loading reference genome from: {fasta_file}\")\n", "record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file, \"fasta\"))\n", "print(f\"āœ… Loaded {len(record_dict)} chromosome sequences\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "c2efa951", "metadata": {}, "outputs": [], "source": [ "chromosome_dictionary = {\n", " \"1\": \"NC_000001.11\",\n", " \"2\": \"NC_000002.12\",\n", " \"3\": \"NC_000003.12\",\n", " \"4\": \"NC_000004.12\",\n", " \"5\": \"NC_000005.10\",\n", " \"6\": \"NC_000006.12\",\n", " \"7\": \"NC_000007.14\",\n", " \"9\": \"NC_000009.12\",\n", " \"10\": \"NC_000010.11\",\n", " \"11\": \"NC_000011.10\",\n", " \"12\": \"NC_000012.12\",\n", " \"13\": \"NC_000013.11\",\n", " \"14\": \"NC_000014.9\",\n", " \"15\": \"NC_000015.10\",\n", " \"16\": \"NC_000016.10\",\n", " \"17\": \"NC_000017.11\",\n", " \"18\": \"NC_000018.10\",\n", " \"19\": \"NC_000019.10\",\n", " \"20\": \"NC_000020.11\",\n", " \"21\": \"NC_000021.9\",\n", " \"23\": \"NC_000023.11\"\n", "}" ] }, { "cell_type": "markdown", "id": "a1323f95", "metadata": {}, "source": [ "### Verification that the reference is present at the exact position I have in my data" ] }, { "cell_type": "code", "execution_count": null, "id": "c0ec0979", "metadata": {}, "outputs": [], "source": [ "# Verify reference sequences against genome\n", "verification_file = CONFIG['verification_file']\n", "print(f\"Starting sequence verification - results will be saved to: {verification_file}\")\n", "\n", "with open(verification_file, \"w\") as f:\n", " for i in range(len(variant_data)):\n", " # ---- Input ----\n", " chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n", " if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n", " start = variant_data.iloc[i]['Start'] - 1\n", " else:\n", " start = variant_data.iloc[i]['Start']\n", " reference_allele = variant_data.iloc[i]['RefAllele']\n", " end = len(reference_allele) + start\n", "\n", " chrom_seq = record_dict[chromosome_id].seq\n", "\n", " # Adjust for 0-based indexing in Python\n", " genomic_ref = chrom_seq[start: start + len(reference_allele)]\n", "\n", " if genomic_ref.upper() != reference_allele.upper():\n", " f.write(f\"āš ļø Warning: Entry number {i} with variant {variant_data.iloc[i]['ID']} expected '{reference_allele}', but found '{genomic_ref}'\\n\")\n", " else:\n", " f.write(f\"āœ… Verified: {chromosome_id}:{start}-{end} → '{reference_allele}' matches genome\\n\")\n", " \n", " if (i + 1) % 100 == 0:\n", " print(f\"Verified {i + 1}/{len(variant_data)} variants...\")\n", "\n", "print(f\"āœ… Verification complete. Results saved to: {verification_file}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "39174efe", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Assuming CONFIG is defined somewhere above in the code\n", "# CONFIG = {'nt_seq_dir': 'desired/path/to/nt_seq'}\n", "\n", "# Create nucleotide sequence directory\n", "nt_seq_dir = Path(CONFIG['nt_seq_dir'])\n", "nt_seq_dir.mkdir(exist_ok=True)\n", "print(f\"Created nucleotide sequence directory: {nt_seq_dir}\")" ] }, { "cell_type": "markdown", "id": "3065cf9d", "metadata": {}, "source": [ "### Performing the mutation and saving the reference and variant allele with a 1000 nt window" ] }, { "cell_type": "code", "execution_count": null, "id": "6121945f", "metadata": {}, "outputs": [], "source": [ "# Generate nucleotide sequences with mutations\n", "nt_seq_dir = CONFIG['nt_seq_dir']\n", "window = CONFIG['sequence_window']\n", "\n", "print(f\"Generating nucleotide sequences with {window}bp windows...\")\n", "print(f\"Output directory: {nt_seq_dir}\")\n", "\n", "for i in range(len(variant_data)):\n", " output_file = f\"{nt_seq_dir}/{variant_data.iloc[i]['Var_ID']}.txt\"\n", " \n", " with open(output_file, \"w\") as f:\n", " # ---- Input ----\n", " chromosome_id = chromosome_dictionary[str(variant_data.iloc[i]['Chr'])]\n", " if (variant_data.iloc[i]['TranscriptID'][:4] == \"ENST\"):\n", " start = variant_data.iloc[i]['Start'] - 1\n", " else:\n", " start = variant_data.iloc[i]['Start']\n", " reference_allele = variant_data.iloc[i]['RefAllele']\n", " variant_allele = variant_data.iloc[i]['AltAllele']\n", "\n", " end = len(reference_allele) + start\n", " \n", " chrom_seq = record_dict[chromosome_id].seq\n", "\n", " # Extract region\n", " region_start = max(0, start - window)\n", " region_end = end + window\n", "\n", " ref_seq = chrom_seq[region_start:region_end]\n", " \n", " if (variant_allele == \"deletion\"):\n", " # Apply mutation\n", " mutated_seq = ref_seq[:window] + ref_seq[window + len(reference_allele):]\n", " \n", " f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n", " f.write(f\"{ref_seq}\\n\")\n", " f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n", " f.write(f\"{mutated_seq}\\n\")\n", " else:\n", " del_len = len(reference_allele)\n", " # Apply mutation\n", " mutated_seq = ref_seq[:window] + variant_allele + ref_seq[window + del_len:]\n", " \n", " f.write(f\">{variant_data.iloc[i]['ID']}_reference_{reference_allele}\\n\")\n", " f.write(f\"{ref_seq}\\n\")\n", " f.write(f\">{variant_data.iloc[i]['ID']}_variant_{variant_allele}\\n\")\n", " f.write(f\"{mutated_seq}\\n\")\n", " \n", " if (i + 1) % 100 == 0:\n", " print(f\"Generated sequences for {i + 1}/{len(variant_data)} variants...\")\n", "\n", "print(f\"āœ… Sequence generation complete. {len(variant_data)} sequence files created in {nt_seq_dir}\")" ] }, { "cell_type": "markdown", "id": "a83e9272-b34f-40f3-aedf-3aca0795944f", "metadata": {}, "source": [ "# Adding in more Variant Data\n", "\n", "# Data Integration\n", "\n", "This section merges variant information with the main dataset to create a comprehensive database with all relevant annotations." ] }, { "cell_type": "code", "execution_count": 34, "id": "9222e45a-7f9a-4762-8dd8-2cccc654ad3e", "metadata": {}, "outputs": [], "source": [ "final_data = variant_data.merge(variant_info, on='Entry')" ] }, { "cell_type": "code", "execution_count": null, "id": "ae6d44d0-d1f2-4d41-b59d-f8c5888b4914", "metadata": {}, "outputs": [], "source": [ "final_data" ] }, { "cell_type": "code", "execution_count": null, "id": "8ab406cd-e9be-4885-811a-f3e2526efe8a", "metadata": {}, "outputs": [], "source": [ "# Save merged variant data\n", "output_file = CONFIG['network_data_file']\n", "final_data.to_csv(output_file, sep='\\t', header=True, index=False)\n", "print(f\"āœ… Final variant data with merged information saved to: {output_file}\")\n", "print(f\"Dataset contains {len(final_data)} variants with complete information\")" ] }, { "cell_type": "markdown", "id": "2ecb5318-ab15-4625-b556-50f8ff39cff3", "metadata": {}, "source": [ "# Pulling Disease info\n", "\n", "# Disease Information Processing\n", "\n", "This section extracts disease identifiers from the variant data and downloads corresponding disease information from KEGG to create human-readable disease names." ] }, { "cell_type": "code", "execution_count": 44, "id": "b266aa61-7a7f-49c7-a737-578b51b95f32", "metadata": {}, "outputs": [], "source": [ "import ast" ] }, { "cell_type": "code", "execution_count": 65, "id": "a7aa0417-b1c2-40c9-ad67-f2077d1f1d3e", "metadata": {}, "outputs": [], "source": [ "diseases = []" ] }, { "cell_type": "code", "execution_count": 66, "id": "a0865917-9074-43f4-98a1-74bdb456b2e5", "metadata": {}, "outputs": [], "source": [ "for i in range(len(final_data)):\n", " diseases.extend(list(ast.literal_eval(final_data['Disease'][i]).keys()))" ] }, { "cell_type": "code", "execution_count": 74, "id": "8b469aee-d8fb-439d-a8bc-e8cb113ddc8f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "disease = set(diseases)" ] }, { "cell_type": "code", "execution_count": null, "id": "e461b5d7-2200-4dbb-b640-ffd6bf2e3ac2", "metadata": {}, "outputs": [], "source": [ "# Save disease identifiers to file\n", "diseases_file = CONFIG['diseases_file']\n", "with open(diseases_file, 'w') as f:\n", " for disease_id in disease:\n", " f.write(f\"{disease_id}\\n\")\n", " \n", "print(f\"āœ… Saved {len(disease)} unique disease identifiers to: {diseases_file}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d079c88f-9e8b-4f80-bf6c-5d9a49155b86", "metadata": {}, "outputs": [], "source": [ "# Working directory already set - proceeding with disease information retrieval\n", "print(\"Starting disease information processing...\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "10d814f3-66ec-4580-866e-2cc2fda34109", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "disease KEGG Disease Database\n", "ds Release 114.0+/04-28, Apr 25\n", " Kanehisa Laboratories\n", " 2,912 entries\n", "\n", "linked db pathway\n", " brite\n", " ko\n", " hsa\n", " genome\n", " network\n", " variant\n", " drug\n", " pubmed\n", "\n" ] } ], "source": [ "kegg_pull rest info disease" ] }, { "cell_type": "code", "execution_count": null, "id": "5f095524-d58f-4869-9d1b-5459de85329d", "metadata": {}, "outputs": [], "source": [ "kegg_pull --full-help" ] }, { "cell_type": "code", "execution_count": null, "id": "5ed80556-4df8-4f0b-8c3e-2a6458c6dd6d", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Assuming CONFIG is defined somewhere earlier in the code\n", "# CONFIG = {'disease_info_dir': 'desired/path/to/disease_info'}\n", "\n", "# Create disease information directory\n", "disease_dir = Path(CONFIG['disease_info_dir'])\n", "disease_dir.mkdir(exist_ok=True)\n", "print(f\"Created disease information directory: {disease_dir}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "96851b67-0689-4aa0-9208-a0cdabf95425", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 44/44 [00:06<00:00, 6.56it/s]\n" ] } ], "source": [ "# Download disease information using kegg_pull\n", "diseases_file = CONFIG['diseases_file']\n", "disease_output_dir = CONFIG['disease_info_dir']\n", "\n", "if not os.path.exists(diseases_file):\n", " print(f\"āŒ Diseases file not found: {diseases_file}\")\n", " print(\"Please run the previous cells to generate the diseases list\")\n", "else:\n", " print(f\"Downloading disease information for entries in: {diseases_file}\")\n", " print(f\"Output directory: {disease_output_dir}\")\n", " # Run the command to download disease information\n", " !cat {diseases_file} | kegg_pull pull entry-ids - --output={disease_output_dir}\n", " print(\"āœ… Disease information download complete\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4c01f97c-6376-4266-97e9-1d29ef207a51", "metadata": {}, "outputs": [], "source": [ "# Processing disease information files\n", "print(\"Parsing disease information from KEGG files...\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4ea01eac-ee3c-4a5e-9863-2fb061291b45", "metadata": {}, "outputs": [], "source": [ "# Parse disease information from downloaded files\n", "diseases_file = CONFIG['diseases_file']\n", "disease_info_dir = Path(CONFIG['disease_info_dir'])\n", "\n", "# Read all disease identifiers from diseases.txt\n", "with open(diseases_file, 'r') as f:\n", " disease_files = [line.strip() for line in f if line.strip()]\n", "\n", "print(f\"Processing {len(disease_files)} disease information files...\")\n", "\n", "# Initialize an empty dictionary\n", "disease_info = {}\n", "\n", "# Function to extract the value after a keyword\n", "def extract_value(line, key):\n", " return line.split(key, 1)[-1].strip()\n", "\n", "# Process each disease file\n", "processed_count = 0\n", "not_found_count = 0\n", "\n", "for disease_id in disease_files:\n", " file_path = disease_info_dir / f'{disease_id}.txt'\n", "\n", " try:\n", " with open(file_path, 'r') as f:\n", " lines = f.readlines()\n", "\n", " name = \"\"\n", "\n", " for line in lines:\n", " line = line.strip()\n", " if line.startswith(\"NAME\"):\n", " name = extract_value(line, \"NAME\")\n", " break # No need to check other lines once NAME is found\n", "\n", " # Save into dictionary: key = disease_id, value = name\n", " disease_info[disease_id] = name\n", " processed_count += 1\n", " \n", " if processed_count % 50 == 0:\n", " print(f\"Processed {processed_count}/{len(disease_files)} disease files...\")\n", "\n", " except FileNotFoundError:\n", " print(f\"[Warning] File not found: {file_path}\")\n", " not_found_count += 1\n", "\n", "print(f\"āœ… Disease processing complete: {processed_count} processed, {not_found_count} not found\")\n", "print(f\"Extracted disease information for {len(disease_info)} diseases\")\n", "\n", "# Optional: Save the dictionary to a file (like JSON)\n", "# import json\n", "# with open('disease_info.json', 'w') as f:\n", "# json.dump(disease_info, f, indent=2)" ] }, { "cell_type": "code", "execution_count": 8, "id": "4dfb4f25-776e-45c6-9eda-457b13cd77bf", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'H00135': 'Krabbe disease;',\n", " 'H01398': 'Primary hyperammonemia (Urea cycle disorders)',\n", " 'H00032': 'Thyroid cancer',\n", " 'H00559': 'von Hippel-Lindau syndrome',\n", " 'H00260': 'Pigmented micronodular adrenocortical disease',\n", " 'H00038': 'Melanoma',\n", " 'H00485': 'Robinow syndrome',\n", " 'H00251': 'Thyroid dyshormonogenesis;',\n", " 'H00194': 'Lesch-Nyhan syndrome;',\n", " 'H00026': 'Endometrial cancer',\n", " 'H00020': 'Colorectal cancer',\n", " 'H00031': 'Breast cancer',\n", " 'H02049': 'Bilateral macronodular adrenal hyperplasia',\n", " 'H00042': 'Glioma',\n", " 'H00063': 'Spinocerebellar ataxia (SCA)',\n", " 'H00195': 'Adenine phosphoribosyltransferase deficiency;',\n", " 'H00033': 'Adrenal carcinoma',\n", " 'H00048': 'Hepatocellular carcinoma;',\n", " 'H01522': 'Zollinger-Ellison syndrome',\n", " 'H00019': 'Pancreatic cancer',\n", " 'H00004': 'Chronic myeloid leukemia',\n", " 'H00058': 'Amyotrophic lateral sclerosis (ALS);',\n", " 'H00022': 'Bladder cancer',\n", " 'H00056': 'Alzheimer disease;',\n", " 'H01032': 'N-acetylglutamate synthase deficiency',\n", " 'H00247': 'Multiple endocrine neoplasia syndrome;',\n", " 'H00246': 'Primary hyperparathyroidism;',\n", " 'H00039': 'Basal cell carcinoma',\n", " 'H00021': 'Renal cell carcinoma',\n", " 'H00013': 'Small cell lung cancer',\n", " 'H00003': 'Acute myeloid leukemia',\n", " 'H00018': 'Gastric cancer',\n", " 'H01603': 'Primary aldosteronism',\n", " 'H00061': 'Prion disease',\n", " 'H00014': 'Non-small cell lung cancer',\n", " 'H00423': 'Sphingolipidosis',\n", " 'H00024': 'Prostate cancer',\n", " 'H01102': 'Pituitary adenomas',\n", " 'H00034': 'Carcinoid',\n", " 'H00059': 'Huntington disease',\n", " 'H01431': 'Cushing syndrome',\n", " 'H00057': 'Parkinson disease',\n", " 'H00126': 'Gaucher disease',\n", " 'H02221': 'Methylmalonic aciduria and homocystinuria'}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "disease_info" ] }, { "cell_type": "code", "execution_count": null, "id": "458ca725-03e8-4b2a-98e7-f418f40190fb", "metadata": {}, "outputs": [], "source": [ "# Reload variant data for disease processing\n", "variant_data = pd.read_csv(CONFIG['network_data_file'], sep='\\t')\n", "print(f\"Processing disease information for {len(variant_data)} variants\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "e86ddd65-cbde-42d3-be6f-cbc54e2dda06", "metadata": {}, "outputs": [], "source": [ "import ast\n", "\n", "# Assume disease_info is already a dictionary {\"D001\": \"Cancer\", \"D002\": \"Diabetes\", ...}\n", "\n", "# Create a new column to store disease dictionaries\n", "variant_data[\"Disease_Names\"] = \"\"\n", "\n", "# Process each row\n", "for idx, row in variant_data.iterrows():\n", " try:\n", " # Convert the string dictionary into a real dictionary\n", " disease_dict = ast.literal_eval(row[\"Disease\"])\n", "\n", " # Get the disease IDs (keys)\n", " disease_ids = disease_dict.keys()\n", "\n", " # Build a new dictionary: {disease_id: disease_name}\n", " disease_names_dict = {did: disease_info.get(did, \"\") for did in disease_ids}\n", "\n", " # Save it into the Disease_Names column\n", " variant_data.at[idx, \"Disease_Names\"] = disease_names_dict\n", "\n", " except (ValueError, SyntaxError):\n", " print(f\"[Warning] Couldn't parse disease info at row {idx}\")\n", " variant_data.at[idx, \"Disease_Names\"] = {}" ] }, { "cell_type": "code", "execution_count": null, "id": "06a29f96-56b2-46b2-897e-d7006dd0ae52", "metadata": {}, "outputs": [], "source": [ "# Save updated variant data with disease names\n", "output_file = CONFIG['network_data_file']\n", "variant_data.to_csv(output_file, sep='\\t', header=True, index=False)\n", "print(f\"āœ… Updated variant data saved to: {output_file}\")\n", "print(f\"Dataset now includes disease names for {len(variant_data)} variants\")" ] }, { "cell_type": "code", "execution_count": null, "id": "674b4a4a-93ab-4fdd-af73-cf0351381fe6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }