{ "cells": [ { "cell_type": "markdown", "id": "0103d07d", "metadata": {}, "source": [ "# KEGG Data Processing Pipeline - Part 2: Variant Information Parsing and Sequence Generation\n", "\n", "## Overview\n", "\n", "This notebook is the second part of the KEGG data processing pipeline. It focuses on parsing variant information from KEGG data, generating nucleotide sequences with mutations, and creating disease mapping databases.\n", "\n", "## What This Notebook Does\n", "\n", "1. **Variant Information Parsing**: Extracts detailed information from KEGG variant files\n", "2. **Sequence Generation**: Creates reference and variant nucleotide sequences with genomic context\n", "3. **Disease Mapping**: Downloads and processes KEGG disease information\n", "4. **Data Integration**: Merges variant data with genomic sequences and disease annotations\n", "5. **Quality Control**: Validates reference sequences against the genome\n", "\n", "## Prerequisites\n", "\n", "**Required from Part 1 (KEGG_Data_1.ipynb):**\n", "- `gene_variants.txt` - List of variant identifiers\n", "- `variant_info/` directory - Individual variant information files\n", "- `final_network_with_variant.tsv` - Network and variant mapping data\n", "\n", "**Additional Requirements:**\n", "- Reference genome FASTA file (GRCh38)\n", "- BioPython for sequence processing\n", "- KEGG_pull for disease information retrieval\n", "\n", "## Required Packages\n", "\n", "```bash\n", "pip install biopython pandas kegg-pull\n", "```\n", "\n", "## Input Files Expected\n", "\n", "- `gene_variants.txt` - Variant identifiers from Part 1\n", "- `variant_info/*.txt` - Individual variant information files\n", "- `chromosomes.fasta` - Reference genome sequences\n", "- `final_network_with_variant.tsv` - Network-variant mapping\n", "\n", "## Output Files Generated\n", "\n", "- `nt_seq/` - Directory containing reference and variant sequences\n", "- `verification.txt` - Quality control results\n", "- `diseases.txt` - List of disease identifiers\n", "- `disease_info/` - Disease information files\n", "- Updated `final_network_with_variant.tsv` with disease names\n", "\n", "## Important Notes\n", "\n", "- **Memory Usage**: Processing large genomic sequences requires significant RAM\n", "- **Storage**: Generated sequence files can be several GB in size\n", "- **Processing Time**: Full pipeline may take several hours depending on dataset size\n", "- **Dependencies**: Requires successful completion of KEGG_Data_1.ipynb\n", "\n", "## Next Steps\n", "\n", "After completing this notebook, run `KEGG_Data_3.ipynb` for final dataset creation and sequence integration." ] }, { "cell_type": "markdown", "id": "ccc3ca96", "metadata": {}, "source": [ "## Configuration\n", "\n", "Set up paths and parameters for variant processing:" ] }, { "cell_type": "code", "execution_count": null, "id": "28d2629e", "metadata": {}, "outputs": [], "source": [ "# Configuration - Update these paths for your environment\n", "import os\n", "from pathlib import Path\n", "\n", "# Navigate to kegg_data directory\n", "data_dir = Path('kegg_data')\n", "if not data_dir.exists():\n", " print(\"ā kegg_data directory not found. Please run KEGG_Data_1.ipynb first.\")\n", " raise FileNotFoundError(\"kegg_data directory missing\")\n", "\n", "os.chdir(data_dir)\n", "\n", "# Configuration parameters\n", "CONFIG = {\n", " # Input files (should exist from Part 1)\n", " 'gene_variants_file': 'gene_variants.txt',\n", " 'variant_info_dir': 'variant_info',\n", " 'network_data_file': 'final_network_with_variant.tsv',\n", " \n", " # Reference genome (update path as needed)\n", " 'reference_fasta': 'chromosomes.fasta', # Update to your reference genome path\n", " \n", " # Output directories\n", " 'nt_seq_dir': 'nt_seq',\n", " 'disease_info_dir': 'disease_info',\n", " \n", " # Processing parameters\n", " 'sequence_window': 2000, # Nucleotides around variant\n", " 'verification_file': 'verification.txt',\n", " 'diseases_file': 'diseases.txt'\n", "}\n", "\n", "# Verify required input files\n", "required_files = ['gene_variants.txt', 'final_network_with_variant.tsv']\n", "missing_files = []\n", "for file in required_files:\n", " if not os.path.exists(file):\n", " missing_files.append(file)\n", "\n", "if missing_files:\n", " print(f\"ā Missing required files: {missing_files}\")\n", " print(\"Please run KEGG_Data_1.ipynb first to generate these files.\")\n", "else:\n", " print(\"ā All required input files found\")\n", "\n", "# Create output directories\n", "for dir_name in [CONFIG['nt_seq_dir'], CONFIG['disease_info_dir']]:\n", " Path(dir_name).mkdir(exist_ok=True)\n", "\n", "print(f\"Working directory: {os.getcwd()}\")\n", "print(\"\\nš Update CONFIG['reference_fasta'] with path to your reference genome file\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d05a4d10-03de-42ae-89c1-5ddbe77043a7", "metadata": {}, "outputs": [], "source": [ "# Working directory already set in configuration section above\n", "print(f\"Current working directory: {os.getcwd()}\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "96662dbb-ee2c-4a74-8e45-ab58a3496976", "metadata": {}, "outputs": [], "source": [ "sed -i '' 's/:/_/g' gene_variants.txt" ] }, { "cell_type": "code", "execution_count": 10, "id": "db4f4cf2-cd95-4df8-99b6-cc112857502f", "metadata": {}, "outputs": [], "source": [ "while read p; do\n", " if ! grep -q NAME variant_info/$p.txt; then\n", " echo \"$p\"\n", " fi\n", "done < gene_variants.txt" ] }, { "cell_type": "code", "execution_count": 12, "id": "11959296-d5cb-4fb4-9914-83596dd41c86", "metadata": {}, "outputs": [], "source": [ "while read p; do\n", " if ! grep -q GENE variant_info/$p.txt; then\n", " echo \"$p\"\n", " fi\n", "done < gene_variants.txt" ] }, { "cell_type": "markdown", "id": "784d0394-1a14-471a-9def-f4877b4bbd4e", "metadata": {}, "source": [ "# Pulling Info from the Variant File\n", "\n", "# Variant Information Parsing\n", "\n", "This section processes individual variant files to extract structured information including variant names, genes, and types." ] }, { "cell_type": "code", "execution_count": null, "id": "62b4167a-6d5a-4120-99fe-5678227db6cc", "metadata": {}, "outputs": [], "source": [ "# Working directory already set - proceeding with variant information parsing\n", "print(f\"Processing variant files from: {CONFIG['variant_info_dir']}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3ed32b62-e3a6-4cff-b4ab-a80f04725a1c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "from pathlib import Path\n", "import os\n", "\n", "# Read all file names from gene_variants.txt\n", "gene_variants_file = CONFIG['gene_variants_file']\n", "if not os.path.exists(gene_variants_file):\n", " print(f\"ā Gene variants file not found: {gene_variants_file}\")\n", " print(\"Please run KEGG_Data_1.ipynb first to generate this file\")\n", " raise FileNotFoundError(f\"Gene variants file not found: {gene_variants_file}\")\n", "\n", "with open(gene_variants_file, 'r') as f:\n", " variant_files = [line.strip() for line in f if line.strip()]\n", "\n", "print(f\"Processing {len(variant_files)} variant files\")\n", "\n", "# Initialize an empty DataFrame to collect the results\n", "variant_info = pd.DataFrame(columns=[\"Entry\", \"Variant_Name\", \"Variant_Gene\", \"Variant_Gene Info\", \"Variant_Type\"])\n", "\n", "# Function to extract the value after a keyword (single line, rest of the line)\n", "def extract_value(line, key):\n", " return line.split(key, 1)[-1].strip()\n", "\n", "# Process each variant file\n", "variant_info_dir = Path(CONFIG['variant_info_dir'])\n", "processed_count = 0\n", "not_found_count = 0\n", "\n", "for file_name in variant_files:\n", " file_path = variant_info_dir / f\"{file_name}.txt\"\n", "\n", " try:\n", " with open(file_path, 'r') as f:\n", " lines = f.readlines()\n", "\n", " name = \"\"\n", " gene = \"\"\n", " gene_info = \"\"\n", " type_info = \"\"\n", "\n", " for line in lines:\n", " line = line.strip()\n", " if line.startswith(\"NAME\"):\n", " name = extract_value(line, \"NAME\")\n", " elif line.startswith(\"GENE\"):\n", " gene_data = extract_value(line, \"GENE\")\n", " if gene_data:\n", " parts = gene_data.split(maxsplit=1)\n", " gene = parts[0]\n", " gene_info = parts[1] if len(parts) > 1 else \"\"\n", " elif line.startswith(\"TYPE\"):\n", " type_info = extract_value(line, \"TYPE\")\n", "\n", " row = {\n", " \"Entry\": file_name,\n", " \"Variant_Name\": name,\n", " \"Variant_Gene\": gene,\n", " \"Variant_Gene Info\": gene_info,\n", " \"Variant_Type\": type_info\n", " }\n", "\n", " variant_info = pd.concat([variant_info, pd.DataFrame([row])], ignore_index=True)\n", " processed_count += 1\n", " \n", " if processed_count % 100 == 0:\n", " print(f\"Processed {processed_count}/{len(variant_files)} files...\")\n", "\n", " except FileNotFoundError:\n", " print(f\"[Warning] File not found: {file_path}\")\n", " not_found_count += 1\n", "\n", "print(f\"ā Processing complete: {processed_count} files processed, {not_found_count} files not found\")\n", "print(f\"Extracted information for {len(variant_info)} variants\")\n", "\n", "# Optional: Save the final table\n", "# variant_info.to_csv(\"parsed_variant_info.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 32, "id": "85e94a07-740d-44cd-a1c3-2330a30b99b1", "metadata": {}, "outputs": [], "source": [ "variant_info[\"Entry\"] = variant_info[\"Entry\"].str.replace(\"hsa_var_\", \"\", regex=False)" ] }, { "cell_type": "code", "execution_count": 33, "id": "4fc8fa00-2a28-4bd9-9aed-5c4602969cca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Entry | \n", "Variant_Name | \n", "Variant_Gene | \n", "Variant_Gene Info | \n", "Variant_Type | \n", "
|---|---|---|---|---|---|
| 0 | \n", "1019v2 | \n", "CDK4 mutation | \n", "CDK4 | \n", "cyclin dependent kinase 4 [KO:K02089] | \n", "\n", " |
| 1 | \n", "1027v3 | \n", "CDKN1B mutation | \n", "CDKN1B | \n", "cyclin dependent kinase inhibitor 1B [KO:K06624] | \n", "\n", " |
| 2 | \n", "10280v1 | \n", "SIGMAR1 mutation | \n", "SIGMAR1 | \n", "sigma non-opioid intracellular receptor 1 [KO:... | \n", "\n", " |
| 3 | \n", "1029v2 | \n", "CDKN2A mutation | \n", "CDKN2A | \n", "cyclin dependent kinase inhibitor 2A [KO:K06621] | \n", "\n", " |
| 4 | \n", "11315v1 | \n", "PARK7 mutation | \n", "PARK7 | \n", "Parkinsonism associated deglycase [KO:K05687] | \n", "\n", " |
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 90 | \n", "9049v1 | \n", "AIP mutation | \n", "AIP | \n", "AHR interacting HSP90 co-chaperone [KO:K17767] | \n", "\n", " |
| 91 | \n", "9101v1 | \n", "USP8 mutation | \n", "USP8 | \n", "ubiquitin specific peptidase 8 [KO:K11839] | \n", "\n", " |
| 92 | \n", "9217v1 | \n", "VAPB mutation | \n", "VAPB | \n", "VAMP associated protein B and C [KO:K10707] | \n", "\n", " |
| 93 | \n", "9817v1 | \n", "KEAP1 mutation | \n", "KEAP1 | \n", "kelch like ECH associated protein 1 [KO:K10456] | \n", "\n", " |
| 94 | \n", "999v2 | \n", "CDH1 mutation | \n", "CDH1 | \n", "cadherin 1 [KO:K05689] | \n", "\n", " |
95 rows Ć 5 columns
\n", "