{ "cells": [ { "cell_type": "markdown", "id": "0510f375", "metadata": {}, "source": [ "## Configuration\n", "\n", "Set up parameters and data sources for variant effect prediction tasks:" ] }, { "cell_type": "code", "execution_count": null, "id": "7d59a5d5", "metadata": {}, "outputs": [], "source": [ "# Configuration - Update these parameters for your environment\n", "import os\n", "from pathlib import Path\n", "import random\n", "\n", "# Set random seed for reproducible question assignment\n", "RANDOM_SEED = 42\n", "random.seed(RANDOM_SEED)\n", "\n", "# Configuration parameters\n", "CONFIG = {\n", " # Data source configurations\n", " 'huggingface_repo': 'wanglab/bioR_tasks', # Update with your repository\n", " \n", " # Local data paths (update these if using local files)\n", " 'local_data_dir': 'data',\n", " \n", " # Output configurations\n", " 'output_dir': 'output_datasets',\n", " 'save_local': True, # Save datasets locally\n", " 'upload_to_hub': False, # Set to True to upload to HuggingFace Hub\n", " \n", " # Processing parameters\n", " 'question_variants': 50, # Number of question templates per task\n", " 'batch_size': 1000, # For memory-efficient processing\n", " \n", " # Task configurations\n", " 'tasks': {\n", " 'task1': {'name': 'variant_effect_coding', 'description': 'Pathogenic vs Benign classification'},\n", " 'task2': {'name': 'variant_effect_causal_eqtl', 'description': 'Gene expression change prediction'},\n", " 'task3': {'name': 'variant_effect_pathogenic_omim', 'description': 'OMIM pathogenic classification'},\n", " 'task4_snv': {'name': 'task4_variant_effect_snv', 'description': 'SNV effect prediction'},\n", " 'task4_non_snv': {'name': 'task4_variant_effect_non_snv', 'description': 'Non-SNV effect prediction'}\n", " }\n", "}\n", "\n", "# Create output directory\n", "Path(CONFIG['output_dir']).mkdir(exist_ok=True)\n", "\n", "print(\"Configuration loaded:\")\n", "print(f\" Random seed: {RANDOM_SEED}\")\n", "print(f\" Output directory: {CONFIG['output_dir']}\")\n", "print(f\" Upload to hub: {CONFIG['upload_to_hub']}\")\n", "print(f\" Repository: {CONFIG['huggingface_repo']}\")\n", "print(\"\\nπ Update CONFIG dictionary above with your specific settings\")" ] }, { "cell_type": "markdown", "id": "e4a1e6bc-e3e6-4084-a42e-ba988c3afa4a", "metadata": {}, "source": [ "# Variant Effect Prediction Tasks - Dataset Creation Pipeline\n", "\n", "## Overview\n", "\n", "This notebook creates standardized datasets for variant effect prediction tasks using various genomic databases. It processes raw variant data into machine learning-ready formats with contextualized questions and standardized answers.\n", "\n", "## What This Notebook Does\n", "\n", "1. **Task 1**: Variant Effect Prediction (Pathogenic vs Benign) using ClinVar data\n", "2. **Task 2**: Causal eQTL Analysis (Gene Expression Changes) \n", "3. **Task 3**: Pathogenic Variant Classification using OMIM data\n", "4. **Task 4**: SNV and Non-SNV Variant Effect Prediction\n", "\n", "## Key Features\n", "\n", "- **Question Diversification**: 50+ unique question templates per task type\n", "- **Standardized Format**: Consistent ID, question, answer, sequence structure\n", "- **Multiple Data Sources**: ClinVar, OMIM, eQTL databases\n", "- **Publication-Ready**: Clean, documented datasets ready for research use\n", "\n", "## Dataset Structure\n", "\n", "Each task generates datasets with the following fields:\n", "- `ID`: Unique identifier for each variant\n", "- `question`: Contextualized biological question\n", "- `answer`: Standardized response (pathogenic/benign, disease name, etc.)\n", "- `reference_sequence`: Original genomic sequence\n", "- `variant_sequence`: Mutated genomic sequence\n", "\n", "## Prerequisites\n", "\n", "```bash\n", "pip install datasets pandas numpy\n", "```\n", "\n", "## Usage\n", "\n", "1. **Configure Data Sources**: Update file paths and dataset configurations\n", "2. **Run Tasks Sequentially**: Execute each task section in order\n", "3. **Review Outputs**: Validate generated datasets before publication\n", "4. **Export**: Datasets are saved locally and optionally uploaded to repositories\n", "\n", "## Important Notes\n", "\n", "- **Data Privacy**: All personal references have been removed\n", "- **Reproducibility**: Random seeds should be set for consistent question assignment\n", "- **Memory Usage**: Large datasets may require substantial RAM\n", "- **File Paths**: Update all hardcoded paths to use relative or configurable paths\n", "\n", "## Output\n", "\n", "Generated datasets are suitable for:\n", "- Variant effect prediction model training\n", "- Biological reasoning benchmarks\n", "- Genomic language model evaluation\n", "- Clinical variant interpretation research" ] }, { "cell_type": "markdown", "id": "67ff57a4-00e0-41a4-aac1-18f4e1c68be8", "metadata": {}, "source": [ "## Task 1: variant effect prediction" ] }, { "cell_type": "code", "execution_count": 1, "id": "73559f7e-ade8-4d84-84d4-859fbbd0c575", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "import json\n", "import random" ] }, { "cell_type": "code", "execution_count": null, "id": "3d2ffcc6-5e33-4cc4-a46d-2c8f67bd20a6", "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"wanglab/bioR_tasks\", 'variant_effect_pathogenic_clinvar')\n", "\n", "## Task 1: Variant Effect Prediction (Pathogenic vs Benign)\n", "\n", "**Objective**: Classify genetic variants as pathogenic or benign based on chromosomal location and gene context.\n", "\n", "**Data Source**: ClinVar database with pathogenic variant annotations\n", "\n", "**Question Types**: 50 different question templates incorporating:\n", "- Chromosome location\n", "- Gene information (when available)\n", "- Clinical significance assessment\n", "\n", "**Output Format**: Binary classification with disease association when applicable" ] }, { "cell_type": "markdown", "id": "e31d1d0b-5f13-468c-a2cf-704499214a04", "metadata": {}, "source": [ "**With GPT4o, I created 50 different versions of this question and prompt**\n", "\n", "# Core imports for dataset processing\n", "from datasets import load_dataset, Dataset, DatasetDict\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "import random\n", "from pathlib import Path\n", "\n", "# Set random seed for reproducibility\n", "random.seed(CONFIG.get('random_seed', 42))\n", "\n", "print(\"β Core libraries imported\")\n", "print(f\"Random seed set to: {CONFIG.get('random_seed', 42)}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d3301da8-bed2-4c0f-835c-6613df88dd11", "metadata": {}, "outputs": [], "source": [ "# Must call format in order of chromosome, gene, gene_name\n", "question_variants_50 = [\"This variant lies on Chromosome {0} and affects the gene {1} ({2}). Based on this context, is the mutation pathogenic or benign? If pathogenic, what disease does it cause?\",\n", "\"Located on Chromosome {0}, this mutation impacts {1} ({2}). What is its biological consequence β is it benign or pathogenic, and which disease is associated if any?\",\n", "\"A genetic alteration is present in {1} ({2}) on Chromosome {0}. Is this variant benign or disease-causing, and if the latter, which condition is involved?\",\n", "\"This variant affects the gene {1} ({2}) found on Chromosome {0}. What is the clinical effect of this variant β benign or pathogenic? State the disease if applicable.\",\n", "\"With a mutation on Chromosome {0} in gene {1} ({2}), classify this variant as benign or pathogenic. Include the disease if it's pathogenic.\",\n", "\"This sequence change occurs on Chromosome {0}, altering {1} ({2}). What is the medical significance of this variant β is it benign or linked to a disease?\",\n", "\"Here is a variant affecting {1} ({2}) on Chromosome {0}. Please identify whether it is a benign mutation or associated with a disorder.\",\n", "\"A variant on Chromosome {0} in gene {1} ({2}) has been observed. Is this a neutral mutation, or does it result in a disease? If so, which one?\",\n", "\"The gene {1} ({2}) on Chromosome {0} contains a mutation. Based on this information, is the variant pathogenic or benign? Provide the disease if relevant.\",\n", "\"This genomic variant is located on Chromosome {0}, within the {1} ({2}) gene. Can you determine its pathogenicity and name any linked disease?\",\n", "\"A mutation found in {1} ({2}) on Chromosome {0} may be clinically relevant. Is it pathogenic or benign, and if the former, which disease is implicated?\",\n", "\"Given a variant located on Chromosome {0} and affecting {1} ({2}), assess whether it is benign or pathogenic. Indicate the associated disease if pathogenic.\",\n", "\"This mutation is located in gene {1} ({2}) on Chromosome {0}. Is it associated with a disease or is it a benign polymorphism?\",\n", "\"A variant has been detected on Chromosome {0} in {1} ({2}). What is its effect β pathogenic or benign? If pathogenic, name the disease.\",\n", "\"The variant affects gene {1} ({2}), which is on Chromosome {0}. Please evaluate whether this mutation is benign or pathogenic and specify the disease if necessary.\",\n", "\"This alteration in {1} ({2}) on Chromosome {0} may affect gene function. Does it lead to a disease or is it benign?\",\n", "\"Given this variant in gene {1} ({2}) on Chromosome {0}, classify it as benign or pathogenic. Include the disorder it may cause if applicable.\",\n", "\"A variant was discovered on Chromosome {0}, affecting {1} ({2}). What is its functional impact β neutral or pathogenic? State the disease if pathogenic.\",\n", "\"This gene mutation involves {1} ({2}) on Chromosome {0}. Is it associated with any clinical condition, or is it benign?\",\n", "\"The gene {1} ({2}) on Chromosome {0} carries this variant. Does this mutation lead to a specific disease, or is it non-pathogenic?\",\n", "\"Here is a mutation in {1} ({2}) on Chromosome {0}. Determine whether itβs benign or pathogenic. If the latter, what disease does it cause?\",\n", "\"A variant found in Chromosome {0} affects {1} ({2}). Please analyze its biological impact: is it benign or pathogenic, and what condition might it cause?\",\n", "\"The following genetic variant occurs in {1} ({2}) on Chromosome {0}. Classify its clinical effect β pathogenic or benign β and list any associated condition.\",\n", "\"This alteration occurs within gene {1} ({2}) located on Chromosome {0}. Is it associated with a disease or is it a benign variant?\",\n", "\"A mutation on Chromosome {0} affecting {1} ({2}) has been found. Is it harmful or harmless? What disease, if any, does it cause?\",\n", "\"Gene {1} ({2}) on Chromosome {0} is impacted by this variant. Evaluate whether it is clinically benign or pathogenic and name the disorder if relevant.\",\n", "\"Consider this mutation in {1} ({2}) on Chromosome {0}. Is this a benign change or a disease-causing variant?\",\n", "\"A variant was discovered in gene {1} ({2}), Chromosome {0}. Please indicate if this mutation results in a known disease or if it's non-harmful.\",\n", "\"Given this context: Chromosome {0}, gene {1} ({2}) β does this variant present pathogenic behavior, and if so, what disease does it relate to?\",\n", "\"This sequence variant lies in {1} ({2}) on Chromosome {0}. Is it clinically significant, and what condition might it cause if any?\",\n", "\"A mutation in {1} ({2}), located on Chromosome {0}, is being studied. Determine whether itβs pathogenic or benign, and specify the linked disease.\",\n", "\"Here is a genetic alteration in {1} ({2}) on Chromosome {0}. Based on the data, is it a benign variant or a cause of disease?\",\n", "\"Mutation context: Chromosome {0}, Gene {1} ({2}). Determine if this variant is likely to be benign or pathogenic. Mention the disease if applicable.\",\n", "\"A sequence alteration has been identified in {1} ({2}) on Chromosome {0}. Is it disease-inducing or harmless?\",\n", "\"Chromosome {0} houses a mutation in gene {1} ({2}). Classify its clinical impact β is it pathogenic or benign, and what disease does it lead to if any?\",\n", "\"This variant affects gene {1} ({2}) located on Chromosome {0}. Evaluate its biological effect and specify any disease association.\",\n", "\"Gene {1} ({2}) on Chromosome {0} is altered by this variant. Does this mutation result in a disease or is it benign?\",\n", "\"Assess the clinical impact of this variant on gene {1} ({2}), found on Chromosome {0}. State whether itβs pathogenic or benign, and the disease if applicable.\",\n", "\"This is a variant in {1} ({2}), located on Chromosome {0}. Is this mutation a likely cause of disease or not?\",\n", "\"A change on Chromosome {0} affects gene {1} ({2}). Identify whether the variant is neutral or disease-linked. Mention the disease if applicable.\",\n", "\"This variant impacts the gene {1} ({2}) on Chromosome {0}. Is the change likely to result in a pathogenic outcome?\",\n", "\"The gene {1} ({2}) is located on Chromosome {0}, where a mutation has occurred. What is the medical relevance of this mutation?\",\n", "\"A variant affecting Chromosome {0}, within the gene {1} ({2}), has been observed. Determine if it's benign or associated with disease.\",\n", "\"This mutation occurs in {1} ({2}) on Chromosome {0}. Does this change lead to a known medical condition, or is it benign?\",\n", "\"Gene {1} ({2}), found on Chromosome {0}, is impacted by this variant. What is the biological outcome β benign or pathogenic?\",\n", "\"Consider a variant on Chromosome {0} in gene {1} ({2}). Determine its clinical classification and disease relevance.\",\n", "\"An alteration has been detected in {1} ({2}) on Chromosome {0}. Is it pathogenic, and if so, what disease is involved?\",\n", "\"The gene {1} ({2}), on Chromosome {0}, contains a mutation. Does this mutation cause a disorder, or is it a benign change?\",\n", "\"Hereβs a variant in {1} ({2}) located on Chromosome {0}. What is the predicted biological effect β harmless or disease-causing?\",\n", "\"A genomic change on Chromosome {0} affects {1} ({2}). Classify this variant as benign or pathogenic, and name the disease if relevant.\"]\n", "# Load Task 1 dataset from configured source\n", "try:\n", " dataset = load_dataset(CONFIG['huggingface_repo'], 'variant_effect_pathogenic_clinvar')\n", " print(f\"β Loaded dataset from: {CONFIG['huggingface_repo']}\")\n", " print(f\"Train samples: {len(dataset['train'])}\")\n", " print(f\"Test samples: {len(dataset['test'])}\")\n", "except Exception as e:\n", " print(f\"β Error loading dataset: {e}\")\n", " print(\"Please check the repository name and dataset configuration\")\n", " raise" ] }, { "cell_type": "code", "execution_count": null, "id": "5a418765-c127-44a8-8003-822b413cc907", "metadata": {}, "outputs": [], "source": [ "question_variants_50_no_gene = [\n", " \"This variant lies on Chromosome {0}. Based on this context, is the mutation pathogenic or benign? If pathogenic, what disease does it cause?\",\n", " \"Located on Chromosome {0}, this mutation has been observed. What is its biological consequence β is it benign or pathogenic, and which disease is associated if any?\",\n", " \"A genetic alteration is present on Chromosome {0}. Is this variant benign or disease-causing, and if the latter, which condition is involved?\",\n", " \"This variant is found on Chromosome {0}. What is the clinical effect of this variant β benign or pathogenic? State the disease if applicable.\",\n", " \"With a mutation on Chromosome {0}, classify this variant as benign or pathogenic. Include the disease if it's pathogenic.\",\n", " \"This sequence change occurs on Chromosome {0}. What is the medical significance of this variant β is it benign or linked to a disease?\",\n", " \"Here is a variant on Chromosome {0}. Please identify whether it is a benign mutation or associated with a disorder.\",\n", " \"A variant on Chromosome {0} has been observed. Is this a neutral mutation, or does it result in a disease? If so, which one?\",\n", " \"A mutation is present on Chromosome {0}. Based on this information, is the variant pathogenic or benign? Provide the disease if relevant.\",\n", " \"This genomic variant is located on Chromosome {0}. Can you determine its pathogenicity and name any linked disease?\",\n", " \"A mutation found on Chromosome {0} may be clinically relevant. Is it pathogenic or benign, and if the former, which disease is implicated?\",\n", " \"Given a variant located on Chromosome {0}, assess whether it is benign or pathogenic. Indicate the associated disease if pathogenic.\",\n", " \"This mutation is located on Chromosome {0}. Is it associated with a disease or is it a benign polymorphism?\",\n", " \"A variant has been detected on Chromosome {0}. What is its effect β pathogenic or benign? If pathogenic, name the disease.\",\n", " \"A mutation on Chromosome {0} is under review. Please evaluate whether this mutation is benign or pathogenic and specify the disease if necessary.\",\n", " \"This alteration on Chromosome {0} may affect genome function. Does it lead to a disease or is it benign?\",\n", " \"Given this variant on Chromosome {0}, classify it as benign or pathogenic. Include the disorder it may cause if applicable.\",\n", " \"A variant was discovered on Chromosome {0}. What is its functional impact β neutral or pathogenic? State the disease if pathogenic.\",\n", " \"This mutation on Chromosome {0} may be significant. Is it associated with any clinical condition, or is it benign?\",\n", " \"Chromosome {0} carries this variant. Does this mutation lead to a specific disease, or is it non-pathogenic?\",\n", " \"Here is a mutation located on Chromosome {0}. Determine whether itβs benign or pathogenic. If the latter, what disease does it cause?\",\n", " \"A variant found on Chromosome {0} is being studied. Please analyze its biological impact: is it benign or pathogenic, and what condition might it cause?\",\n", " \"The following genetic variant occurs on Chromosome {0}. Classify its clinical effect β pathogenic or benign β and list any associated condition.\",\n", " \"This alteration occurs on Chromosome {0}. Is it associated with a disease or is it a benign variant?\",\n", " \"A mutation on Chromosome {0} has been found. Is it harmful or harmless? What disease, if any, does it cause?\",\n", " \"A variant on Chromosome {0} is under investigation. Evaluate whether it is clinically benign or pathogenic and name the disorder if relevant.\",\n", " \"Consider this mutation on Chromosome {0}. Is this a benign change or a disease-causing variant?\",\n", " \"A variant was discovered on Chromosome {0}. Please indicate if this mutation results in a known disease or if it's non-harmful.\",\n", " \"Given this context: Chromosome {0} β does this variant present pathogenic behavior, and if so, what disease does it relate to?\",\n", " \"This sequence variant lies on Chromosome {0}. Is it clinically significant, and what condition might it cause if any?\",\n", " \"A mutation located on Chromosome {0} is being studied. Determine whether itβs pathogenic or benign, and specify the linked disease.\",\n", " \"Here is a genetic alteration on Chromosome {0}. Based on the data, is it a benign variant or a cause of disease?\",\n", " \"Mutation context: Chromosome {0}. Determine if this variant is likely to be benign or pathogenic. Mention the disease if applicable.\",\n", " \"A sequence alteration has been identified on Chromosome {0}. Is it disease-inducing or harmless?\",\n", " \"Chromosome {0} houses a mutation. Classify its clinical impact β is it pathogenic or benign, and what disease does it lead to if any?\",\n", " \"This variant is located on Chromosome {0}. Evaluate its biological effect and specify any disease association.\",\n", " \"Chromosome {0} is altered by this variant. Does this mutation result in a disease or is it benign?\",\n", " \"Assess the clinical impact of this variant found on Chromosome {0}. State whether itβs pathogenic or benign, and the disease if applicable.\",\n", " \"This is a variant located on Chromosome {0}. Is this mutation a likely cause of disease or not?\",\n", " \"A change on Chromosome {0} is being evaluated. Identify whether the variant is neutral or disease-linked. Mention the disease if applicable.\",\n", " \"This variant is present on Chromosome {0}. Is the change likely to result in a pathogenic outcome?\",\n", " \"A mutation has occurred on Chromosome {0}. What is the medical relevance of this mutation?\",\n", " \"A variant affecting Chromosome {0} has been observed. Determine if it's benign or associated with disease.\",\n", " \"This mutation occurs on Chromosome {0}. Does this change lead to a known medical condition, or is it benign?\",\n", " \"A genomic variant on Chromosome {0} is under review. What is the biological outcome β benign or pathogenic?\",\n", " \"Consider a variant on Chromosome {0}. Determine its clinical classification and disease relevance.\",\n", " \"An alteration has been detected on Chromosome {0}. Is it pathogenic, and if so, what disease is involved?\",\n", " \"A mutation on Chromosome {0} is under examination. Does this mutation cause a disorder, or is it a benign change?\",\n", " \"Hereβs a variant located on Chromosome {0}. What is the predicted biological effect β harmless or disease-causing?\",\n", " \"A genomic change on Chromosome {0} is noted. Classify this variant as benign or pathogenic, and name the disease if relevant.\",\n", "]" ] }, { "cell_type": "code", "execution_count": 13, "id": "cb12df8d-4303-4cf5-bb4a-73d1351ab059", "metadata": {}, "outputs": [], "source": [ "task_1 = dataset['train'].to_pandas()" ] }, { "cell_type": "code", "execution_count": null, "id": "8eddbccb-16f7-4a0c-b4f0-49f38a9468e0", "metadata": {}, "outputs": [], "source": [ "task_1['label'] = task_1['label'].apply(lambda x: \"Benign\" if x == \"Common\" else x)\n", "task_1['ID'] = ['Task1_train_' + str(i) for i in range(len(task_1))]\n", "task_1 = task_1[['ID', 'label', 'chromosome', 'ref_forward_sequence', 'alt_forward_sequence',\n", " 'gene', 'gene_name', 'disease']]\n", "\n", "task_1 = task_1.set_index('ID')\n", "\n", "task_1_train = []\n", "\n", "for count, id in enumerate(task_1.index):\n", " task_1_train.append({})\n", " task_1_train[count]['ID'] = id\n", " if not (task_1.loc[id]['gene'] or task_1.loc[id]['gene_name']):\n", " task_1_train[count]['question'] = question_variants_50_no_gene[random.randrange(50)].format(task_1.loc[id]['chromosome'])\n", " else:\n", " task_1_train[count]['question'] = question_variants_50[random.randrange(50)].format(task_1.loc[id]['chromosome'], task_1.loc[id]['gene'], task_1.loc[id]['gene_name'])\n", " \n", " if not task_1.loc[id]['disease']:\n", " task_1_train[count]['answer'] = f\"{task_1.loc[id]['label']}\"\n", " else:\n", " task_1_train[count]['answer'] = f\"{task_1.loc[id]['label']}; {task_1.loc[id]['disease']}\"\n", " task_1_train[count]['reference_sequence'] = task_1.loc[id]['ref_forward_sequence']\n", " task_1_train[count]['variant_sequence'] = task_1.loc[id]['alt_forward_sequence']\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "1bc9a04d-ec4d-47e7-9ffd-c4e319b62172", "metadata": {}, "outputs": [], "source": [ "task_1 = dataset['test'].to_pandas()" ] }, { "cell_type": "code", "execution_count": 16, "id": "2fc93817-f219-449e-8c57-8e597a2ca494", "metadata": {}, "outputs": [], "source": [ "task_1['label'] = task_1['label'].apply(lambda x: \"Benign\" if x == \"Common\" else x)\n", "task_1['ID'] = ['Task1_test_' + str(i) for i in range(len(task_1))]\n", "task_1 = task_1[['ID', 'label', 'chromosome', 'ref_forward_sequence', 'alt_forward_sequence',\n", " 'gene', 'gene_name', 'disease']]\n", "\n", "task_1 = task_1.set_index('ID')\n", "\n", "task_1_test = []\n", "\n", "for count, id in enumerate(task_1.index):\n", " task_1_test.append({})\n", " task_1_test[count]['ID'] = id\n", " if not task_1.loc[id]['gene'] or task_1.loc[id]['gene_name']:\n", " task_1_test[count]['question'] = question_variants_50_no_gene[random.randrange(50)].format(task_1.loc[id]['chromosome'])\n", " else:\n", " task_1_test[count]['question'] = question_variants_50[random.randrange(50)].format(task_1.loc[id]['chromosome'], task_1.loc[id]['gene'], task_1.loc[id]['gene_name'])\n", " \n", " if not task_1.loc[id]['disease']:\n", " task_1_test[count]['answer'] = f\"{task_1.loc[id]['label']}\"\n", " else:\n", " task_1_test[count]['answer'] = f\"{task_1.loc[id]['label']}; {task_1.loc[id]['disease']}\"\n", " task_1_test[count]['reference_sequence'] = task_1.loc[id]['ref_forward_sequence']\n", " task_1_test[count]['variant_sequence'] = task_1.loc[id]['alt_forward_sequence']\n", " " ] }, { "cell_type": "code", "execution_count": 18, "id": "5e71a52a-d030-4ad8-9317-a48a81d788a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "48850" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(task_1_train)" ] }, { "cell_type": "code", "execution_count": 19, "id": "5c79704d-8ecc-449c-91e5-c9dd03219028", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1233" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(task_1_test)" ] }, { "cell_type": "code", "execution_count": 20, "id": "6bf828d7-0b1c-4b24-afb9-c0127b6f608c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Here is some context for the variant: It is on Chromosome 8, and affects Gene/s CLN8 (CLN8 transmembrane ER and ERGIC protein). Given this context, what is the biological effect of this variant allele, specifically is the mutation pathogenic or benign? If pathogenic, what disease it will cause?\n" ] } ], "source": [ "print(f\"Here is some context for the variant: It is on Chromosome {task_1.iloc[0]['chromosome']}, and affects Gene/s {task_1.iloc[0]['gene']} ({task_1.iloc[0]['gene_name']}). Given this context, what is the biological effect of this variant allele, specifically is the mutation pathogenic or benign? If pathogenic, what disease it will cause?\")" ] }, { "cell_type": "code", "execution_count": null, "id": "746f4274-768a-4c7a-a878-813e2072ba2b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 21, "id": "2f3b8017-9abf-488e-986f-1d588e24eacf", "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset, DatasetDict\n", "\n", "# Step 1: Create Hugging Face Datasets\n", "train_dataset = Dataset.from_list(task_1_train)\n", "test_dataset = Dataset.from_list(task_1_test)\n", "\n", "# Step 2: Combine into a DatasetDict (to mimic load_dataset)\n", "dataset = DatasetDict({\n", " \"train\": train_dataset,\n", " \"test\": test_dataset\n", "})" ] }, { "cell_type": "code", "execution_count": null, "id": "73077c9f-65d4-451b-879e-f7071029d9f5", "metadata": {}, "outputs": [], "source": [ "dataset.push_to_hub(\n", " \"wanglab/bioR_tasks\",\n", " config_name=\"variant_effect_coding\",\n", " commit_message=\"Upload the finalized Task 1 Variant Effect Coding Data\"\n", ")" ] }, { "cell_type": "markdown", "id": "54a98d27-39d1-47aa-86a3-2a50d85d6df5", "metadata": {}, "source": [ "## Task 2 Variant Effect Causal eQTL" ] }, { "cell_type": "code", "execution_count": null, "id": "68c0a985-09fb-4035-a999-158e743b98d6", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "import json\n", "import random\n", "from pathlib import Path\n", "\n", "# CONFIG dictionary to simulate the configuration settings\n", "CONFIG = {\n", " 'save_local': True,\n", " 'output_dir': './data',\n", " 'upload_to_hub': False,\n", " 'huggingface_repo': 'your_huggingface_repo'\n", "}\n", "\n", "# Load your dataset here\n", "# dataset = load_dataset('your_dataset_name')\n", "\n", "# Save and optionally upload Task 1 dataset\n", "if CONFIG['save_local']:\n", " # Save locally first\n", " output_path = Path(CONFIG['output_dir']) / 'task1_variant_effect_coding'\n", " output_path.mkdir(exist_ok=True)\n", " \n", " # Save as JSON files\n", " # dataset['train'].to_json(output_path / 'train.jsonl')\n", " # dataset['test'].to_json(output_path / 'test.jsonl')\n", " print(f\"β Task 1 dataset saved locally to: {output_path}\")\n", "\n", "if CONFIG['upload_to_hub']:\n", " try:\n", " # dataset.push_to_hub(\n", " # CONFIG['huggingface_repo'],\n", " # config_name=\"variant_effect_coding\",\n", " # commit_message=\"Upload Task 1 Variant Effect Coding Data\"\n", " # )\n", " print(f\"β Task 1 dataset uploaded to: {CONFIG['huggingface_repo']}\")\n", " except Exception as e:\n", " print(f\"β Upload failed: {e}\")\n", " print(\"Please check your HuggingFace credentials and repository permissions\")\n", "else:\n", " print(\"π Upload to hub disabled. Set CONFIG['upload_to_hub'] = True to enable\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b6ff52bb-052a-4a1b-9e28-bbc13ea58594", "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"wanglab/bioR_tasks\", 'variant_effect_causal_eqtl')\n", "\n", "## Task 2: Variant Effect Causal eQTL\n", "\n", "**Objective**: Determine whether genetic variants cause changes in gene expression levels.\n", "\n", "**Data Source**: Expression quantitative trait loci (eQTL) databases\n", "\n", "**Question Types**: 50 different question templates incorporating:\n", "- Chromosome location\n", "- Tissue type context\n", "- Expression change assessment\n", "\n", "**Output Format**: Binary classification (expression change: Yes/No)" ] }, { "cell_type": "code", "execution_count": null, "id": "56f83104-fd46-462a-b6cc-b4a954fcc5bc", "metadata": {}, "outputs": [], "source": [ "print(\"Proceeding with Task 2: Causal eQTL Analysis\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5cbe79bc-b10c-4b27-b700-a80af923ce35", "metadata": {}, "outputs": [], "source": [ "question_variants_50_expr = [\n", " \"This variant is isolated from Chromosome {0} from {1} tissue. Does this variant change gene expression?\",\n", " \"This variant originates from Chromosome {0} in {1} tissue. Does it alter gene expression?\",\n", " \"Does the variant from Chromosome {0}, isolated in {1} tissue, change gene expression?\",\n", " \"Is there a change in gene expression for the Chromosome {0} variant found in {1} tissue?\",\n", " \"For the variant on Chromosome {0} in {1} tissue, does it affect gene expression levels?\",\n", " \"Does a variant on Chromosome {0} taken from {1} tissue modify gene expression?\",\n", " \"When isolated from Chromosome {0} in {1} tissue, does this variant impact gene expression?\",\n", " \"Can the Chromosome {0} variant from {1} tissue change the expression of genes?\",\n", " \"Is gene expression altered by the variant on Chromosome {0} in {1} tissue?\",\n", " \"Does the mutation on Chromosome {0}, found in {1} tissue, result in different gene expression?\",\n", " \"In {1} tissue, does the Chromosome {0} variant change how genes are expressed?\",\n", " \"For a variant from Chromosome {0} in {1} tissue, is gene expression affected?\",\n", " \"Does the Chromosome {0} alteration from {1} tissue lead to a detectable change in gene expression?\",\n", " \"Will the variant on Chromosome {0} in {1} tissue cause gene expression changes?\",\n", " \"Is there an effect on gene expression from the Chromosome {0} variant in {1} tissue?\",\n", " \"Does the Chromosome {0} variant isolated in {1} tissue influence gene expression?\",\n", " \"In {1} tissue, does the mutation on Chromosome {0} disrupt gene expression?\",\n", " \"Does this Chromosome {0} variant, taken from {1} tissue, shift gene expression patterns?\",\n", " \"Does gene expression differ for the variant on Chromosome {0} found in {1} tissue?\",\n", " \"Is the expression of genes altered by the Chromosome {0} variant in {1} tissue?\",\n", " \"For the variant isolated from Chromosome {0} in {1} tissue, does it change gene expression?\",\n", " \"Does the Chromosome {0}-based variant in {1} tissue have an impact on gene expression?\",\n", " \"Is gene expression modulated by the variant on Chromosome {0} in {1} tissue?\",\n", " \"Does the mutation on Chromosome {0} from {1} tissue result in altered gene expression?\",\n", " \"In {1} tissue samples, does the Chromosome {0} variant change gene expression?\",\n", " \"Does the Chromosome {0} alteration observed in {1} tissue affect gene expression?\",\n", " \"Will gene expression be different when the variant is from Chromosome {0} in {1} tissue?\",\n", " \"Does isolating this variant from Chromosome {0} in {1} tissue alter gene expression?\",\n", " \"Does the variant on Chromosome {0} in {1} tissue cause a measurable change in gene expression?\",\n", " \"For Chromosome {0} variants in {1} tissue, does gene expression change?\",\n", " \"Does gene transcription change for the variant on Chromosome {0} isolated from {1} tissue?\",\n", " \"Is transcriptional output altered by the Chromosome {0} variant in {1} tissue?\",\n", " \"Does the Chromosome {0}-derived variant, in {1} tissue, impact gene expression?\",\n", " \"In {1} tissue, does the Chromosome {0} mutation affect expression of genes?\",\n", " \"Does the Chromosome {0} variant from {1} tissue lead to differential gene expression?\",\n", " \"Does changing that locus on Chromosome {0} in {1} tissue alter gene expression?\",\n", " \"Is there a change in transcript levels for the Chromosome {0} variant in {1} tissue?\",\n", " \"Does the variant mapped to Chromosome {0}, in {1} tissue, influence expression levels?\",\n", " \"For the mutation on Chromosome {0} within {1} tissue, does gene expression shift?\",\n", " \"Does gene expression vary when the variant is on Chromosome {0} in {1} tissue?\",\n", " \"Is the expression profile altered by the Chromosome {0} variant in {1} tissue?\",\n", " \"Does the Somatic variant on Chromosome {0} in {1} tissue behave as a gene expression modulator?\",\n", " \"Does the Chromosome {0} variant identified in {1} tissue change gene expression?\",\n", " \"Is there an observable effect on gene expression from the Chromosome {0} variant in {1} tissue?\",\n", " \"Does the genetic alteration on Chromosome {0} in {1} tissue modify gene expression?\",\n", " \"Does the Chromosome {0} variant present in {1} tissue alter the level of gene transcripts?\",\n", " \"For the Chromosome {0} mutation in {1} tissue, is there a change in gene expression?\",\n", " \"Does this variant in {1} tissue, located on Chromosome {0}, affect gene expression?\",\n", " \"Is gene expression impacted by this Chromosome {0} variant from {1} tissue?\",\n", " \"Does transcription change for the Chromosome {0} variant in {1} tissue?\",\n", "]\n", "\n", "# Load Task 2 dataset from configured source\n", "try:\n", " dataset = load_dataset(CONFIG['huggingface_repo'], 'variant_effect_causal_eqtl')\n", " print(f\"β Loaded Task 2 dataset from: {CONFIG['huggingface_repo']}\")\n", " print(f\"Train samples: {len(dataset['train'])}\")\n", " print(f\"Test samples: {len(dataset['test'])}\")\n", "except Exception as e:\n", " print(f\"β Error loading Task 2 dataset: {e}\")\n", " print(\"Please check the repository name and dataset configuration\")\n", " raise" ] }, { "cell_type": "code", "execution_count": 7, "id": "79f9c1d7-4e90-4075-9e45-3b1b8aa22d4c", "metadata": {}, "outputs": [], "source": [ "task_2 = dataset['train'].to_pandas()" ] }, { "cell_type": "code", "execution_count": 8, "id": "d73eaaf0-0b81-4d55-862e-13464c8b78e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['ref_forward_sequence', 'alt_forward_sequence', 'tissue', 'chromosome',\n", " 'label'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "task_2.columns" ] }, { "cell_type": "code", "execution_count": 9, "id": "6092a8d7-a509-45f5-8a60-eae17ab91235", "metadata": {}, "outputs": [], "source": [ "task_2['ID'] = ['Task2_train_' + str(i) for i in range(len(task_2))]\n", "task_2 = task_2[['ID', 'ref_forward_sequence', 'alt_forward_sequence', 'tissue', 'chromosome', 'label']]\n", "\n", "task_2 = task_2.set_index('ID')\n", "\n", "task_2_train = []\n", "\n", "for count, id in enumerate(task_2.index):\n", " task_2_train.append({})\n", " task_2_train[count]['ID'] = id\n", " task_2_train[count]['question'] = question_variants_50_expr[random.randrange(50)].format(task_2.loc[id]['chromosome'], task_2.loc[id]['tissue'])\n", " task_2_train[count]['answer'] = f\"{task_2.loc[id]['label']}\"\n", " task_2_train[count]['reference_sequence'] = task_2.loc[id]['ref_forward_sequence']\n", " task_2_train[count]['variant_sequence'] = task_2.loc[id]['alt_forward_sequence']\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "2ba4c99b-2b0c-4427-9c0d-1df8297f55da", "metadata": {}, "outputs": [], "source": [ "task_2 = dataset['test'].to_pandas()" ] }, { "cell_type": "code", "execution_count": 11, "id": "4b3337ff-ecb1-4f4c-827c-f4d6ac06495f", "metadata": {}, "outputs": [], "source": [ "task_2['ID'] = ['Task2_test_' + str(i) for i in range(len(task_2))]\n", "task_2 = task_2[['ID', 'ref_forward_sequence', 'alt_forward_sequence', 'tissue', 'chromosome', 'label']]\n", "\n", "task_2 = task_2.set_index('ID')\n", "\n", "task_2_test = []\n", "\n", "for count, id in enumerate(task_2.index):\n", " task_2_test.append({})\n", " task_2_test[count]['ID'] = id\n", " task_2_test[count]['question'] = question_variants_50_expr[random.randrange(50)].format(task_2.loc[id]['chromosome'], task_2.loc[id]['tissue'])\n", " task_2_test[count]['answer'] = f\"{task_2.loc[id]['label']}\"\n", " task_2_test[count]['reference_sequence'] = task_2.loc[id]['ref_forward_sequence']\n", " task_2_test[count]['variant_sequence'] = task_2.loc[id]['alt_forward_sequence']" ] }, { "cell_type": "code", "execution_count": 12, "id": "61508b36-7c43-4c7d-9a03-0e6a34571d9c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "89060" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(task_2_train)" ] }, { "cell_type": "code", "execution_count": 13, "id": "3c845102-664b-4c22-a251-1c19362dae6c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8862" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(task_2_test)" ] }, { "cell_type": "code", "execution_count": 14, "id": "17954549-4b01-4767-b6b0-45ecfce93029", "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset, DatasetDict\n", "\n", "# Step 1: Create Hugging Face Datasets\n", "train_dataset = Dataset.from_list(task_2_train)\n", "test_dataset = Dataset.from_list(task_2_test)\n", "\n", "# Step 2: Combine into a DatasetDict (to mimic load_dataset)\n", "dataset = DatasetDict({\n", " \"train\": train_dataset,\n", " \"test\": test_dataset\n", "})" ] }, { "cell_type": "code", "execution_count": null, "id": "a1bbd880-1ece-4d86-a98a-d10c8b042ff7", "metadata": {}, "outputs": [], "source": [ "dataset.push_to_hub(\n", " \"wanglab/bioR_tasks\",\n", " config_name=\"variant_effect_causal_eqtl\",\n", " commit_message=\"Upload the finalized Task 2 Variant Effect Causal EQTL\"\n", ")" ] }, { "cell_type": "markdown", "id": "b8631227-26ab-4dd4-906a-a499816a67ff", "metadata": {}, "source": [ "## Task 3 Variant Effect Pathogenic OMIM" ] }, { "cell_type": "code", "execution_count": null, "id": "38b3b233-94b5-4b3e-a30c-69760534ba41", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "import json\n", "import random\n", "from pathlib import Path\n", "\n", "# CONFIG dictionary to simulate the configuration settings\n", "CONFIG = {\n", " 'save_local': True,\n", " 'output_dir': './data',\n", " 'upload_to_hub': False,\n", " 'huggingface_repo': 'username/repo_name'\n", "}\n", "\n", "# Load your dataset here\n", "# dataset = load_dataset('your_dataset_name')\n", "\n", "# Save and optionally upload Task 2 dataset\n", "if CONFIG['save_local']:\n", " # Save locally first\n", " output_path = Path(CONFIG['output_dir']) / 'task2_variant_effect_causal_eqtl'\n", " output_path.mkdir(exist_ok=True)\n", " \n", " # Save as JSON files\n", " dataset['train'].to_json(output_path / 'train.jsonl')\n", " dataset['test'].to_json(output_path / 'test.jsonl')\n", " print(f\"β Task 2 dataset saved locally to: {output_path}\")\n", "\n", "if CONFIG['upload_to_hub']:\n", " try:\n", " dataset.push_to_hub(\n", " CONFIG['huggingface_repo'],\n", " config_name=\"variant_effect_causal_eqtl\",\n", " commit_message=\"Upload Task 2 Variant Effect Causal eQTL Data\"\n", " )\n", " print(f\"β Task 2 dataset uploaded to: {CONFIG['huggingface_repo']}\")\n", " except Exception as e:\n", " print(f\"β Upload failed: {e}\")\n", "else:\n", " print(\"π Upload to hub disabled. Set CONFIG['upload_to_hub'] = True to enable\")" ] }, { "cell_type": "code", "execution_count": null, "id": "159949b3-3115-4937-8561-de44b4b18dbe", "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"wanglab/bioR_tasks\", 'varient_effect_pathogenic_omim')\n", "\n", "## Task 3: Variant Effect Pathogenic OMIM\n", "\n", "**Objective**: Classify variants as pathogenic or benign using OMIM (Online Mendelian Inheritance in Man) database.\n", "\n", "**Data Source**: OMIM database with genetic disorder associations\n", "\n", "**Question Types**: 50 different question templates focusing on:\n", "- Chromosome location\n", "- Pathogenicity assessment\n", "- Clinical significance\n", "\n", "**Output Format**: Binary classification (Pathogenic/Benign)\n", "\n", "**Note**: This task uses test-only data for evaluation purposes." ] }, { "cell_type": "code", "execution_count": null, "id": "2f89792c-a4ed-4e1b-bdec-aeb8ff938812", "metadata": {}, "outputs": [], "source": [ "print(\"Proceeding with Task 3: OMIM Pathogenic Classification\")" ] }, { "cell_type": "code", "execution_count": null, "id": "835a70b9-ea76-4d9e-a1ae-1a6c3c97aefc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | ref_forward_sequence | \n", "alt_forward_sequence | \n", "chromosome | \n", "label | \n", "
|---|---|---|---|---|
| 0 | \n", "CTCAGAGATTCTGTACATGTTCTTCCTCCTGCCTAGAAAGGATCGT... | \n", "CTCAGAGATTCTGTACATGTTCTTCCTCCTGCCTAGAAAGGATCGT... | \n", "1 | \n", "Common | \n", "
| 1 | \n", "CCTATGGATTGCATCATTATTACCTAAAAAGTCTATTCTCAAATGC... | \n", "CCTATGGATTGCATCATTATTACCTAAAAAGTCTATTCTCAAATGC... | \n", "1 | \n", "Common | \n", "
| 2 | \n", "CTCGGCCCCCAGGCCTGCGTTCAGTGAGGCCTCCCGTGGCGTCAGC... | \n", "CTCGGCCCCCAGGCCTGCGTTCAGTGAGGCCTCCCGTGGCGTCAGC... | \n", "1 | \n", "Common | \n", "
| 3 | \n", "TGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCT... | \n", "TGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCT... | \n", "1 | \n", "Common | \n", "
| 4 | \n", "GAACCCCACGGACATGGACCCCACACTGGAGGACCCCACCGCGCCC... | \n", "GAACCCCACGGACATGGACCCCACACTGGAGGACCCCACCGCGCCC... | \n", "1 | \n", "Common | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 2321468 | \n", "CAACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAA... | \n", "CAACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAA... | \n", "X | \n", "Pathogenic | \n", "
| 2321469 | \n", "ACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAACA... | \n", "ACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAACA... | \n", "X | \n", "Pathogenic | \n", "
| 2321470 | \n", "ATAAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACC... | \n", "ATAAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACC... | \n", "X | \n", "Pathogenic | \n", "
| 2321471 | \n", "AAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACCTG... | \n", "AAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACCTG... | \n", "X | \n", "Pathogenic | \n", "
| 2321472 | \n", "GGTTCAGAAACCTGACTAAAGTTTGGTCAAACAGAGAATCTGTGTC... | \n", "GGTTCAGAAACCTGACTAAAGTTTGGTCAAACAGAGAATCTGTGTC... | \n", "Y | \n", "Pathogenic | \n", "
2321473 rows Γ 4 columns
\n", "| \n", " | mutation_instruction | \n", "original_window | \n", "mutated_window | \n", "pathogenicity | \n", "disease_name | \n", "variant_type | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "AG>A | \n", "AAGGTGCTTAGGACAAAGAAGGCGATTGACATCTTTCAGGTAAAAC... | \n", "AAGGTGCTTAGGACAAAGAAGGCGATTGACATCTTTCAGGTAAAAC... | \n", "not_pathogenic | \n", "Retinitis_pigmentosa | \n", "non_SNV | \n", "
| 1 | \n", "A>G | \n", "CATATTTAAGGTCTATTCTAAATTGCACACTTTGATTCAAAAGAAA... | \n", "CATATTTAAGGTCTATTCTAAATTGCACACTTTGATTCAAAAGAAA... | \n", "not_pathogenic | \n", "NA | \n", "SNV | \n", "
| 2 | \n", "T>G | \n", "TCCACTATTAGACTTCTCTTTATTCTTAAAAATATTTAAGATCACT... | \n", "TCCACTATTAGACTTCTCTTTATTCTTAAAAATATTTAAGATCACT... | \n", "not_pathogenic | \n", "NA | \n", "SNV | \n", "
| 3 | \n", "G>A | \n", "GATTCAGAGTAGTAAAGAGAAAAGTGGAATTTCCAAGCACTATGAA... | \n", "GATTCAGAGTAGTAAAGAGAAAAGTGGAATTTCCAAGCACTATGAA... | \n", "not_pathogenic | \n", "NA | \n", "SNV | \n", "
| 4 | \n", "C>G | \n", "CACTTCTCTCTTTTACATCTTACTTGCCCATTAACTCTTATACCTA... | \n", "CACTTCTCTCTTTTACATCTTACTTGCCCATTAACTCTTATACCTA... | \n", "not_pathogenic | \n", "NA | \n", "SNV | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 3493395 | \n", "CAA>C | \n", "CTACTCCTAATCACATAACCTATTCCCCCGAGCAATCTCAATTACA... | \n", "CTACTCCTAATCACATAACCTATTCCCCCGAGCAATCTCAATTACA... | \n", "not_pathogenic | \n", "Mitochondrial_inheritance | \n", "non_SNV | \n", "
| 3493396 | \n", "C>T | \n", "CAATATATACACCAACAAACAATGTTCAACCAGTAACTACTACTAA... | \n", "CAATATATACACCAACAAACAATGTTCAACCAGTAACTACTACTAA... | \n", "not_pathogenic | \n", "Venous_thromboembolism | \n", "SNV | \n", "
| 3493397 | \n", "A>G | \n", "TACACCAACAAACAATGTTCAACCAGTAACTACTACTAATCAACGC... | \n", "TACACCAACAAACAATGTTCAACCAGTAACTACTACTAATCAACGC... | \n", "not_pathogenic | \n", "MERRF_syndrome|Mitochondrial_inheritance | \n", "SNV | \n", "
| 3493398 | \n", "G>A | \n", "GCCCATAATCATACAAAGCCCCCGCACCAATAGGATCCTCCCGAAT... | \n", "GCCCATAATCATACAAAGCCCCCGCACCAATAGGATCCTCCCGAAT... | \n", "not_pathogenic | \n", "MERRF_syndrome|Mitochondrial_inheritance | \n", "SNV | \n", "
| 3493399 | \n", "G>A | \n", "TCAACCCTGACCCCTCTCCTTCATAAATTATTCAGCTTCCTACACT... | \n", "TCAACCCTGACCCCTCTCCTTCATAAATTATTCAGCTTCCTACACT... | \n", "not_pathogenic | \n", "MERRF_syndrome|Mitochondrial_inheritance | \n", "SNV | \n", "
3493400 rows Γ 6 columns
\n", "| \n", " | ref_forward_sequence | \n", "alt_forward_sequence | \n", "chromosome | \n", "label | \n", "
|---|---|---|---|---|
| 0 | \n", "CTCAGAGATTCTGTACATGTTCTTCCTCCTGCCTAGAAAGGATCGT... | \n", "CTCAGAGATTCTGTACATGTTCTTCCTCCTGCCTAGAAAGGATCGT... | \n", "1 | \n", "Common | \n", "
| 1 | \n", "CCTATGGATTGCATCATTATTACCTAAAAAGTCTATTCTCAAATGC... | \n", "CCTATGGATTGCATCATTATTACCTAAAAAGTCTATTCTCAAATGC... | \n", "1 | \n", "Common | \n", "
| 2 | \n", "CTCGGCCCCCAGGCCTGCGTTCAGTGAGGCCTCCCGTGGCGTCAGC... | \n", "CTCGGCCCCCAGGCCTGCGTTCAGTGAGGCCTCCCGTGGCGTCAGC... | \n", "1 | \n", "Common | \n", "
| 3 | \n", "TGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCT... | \n", "TGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCT... | \n", "1 | \n", "Common | \n", "
| 4 | \n", "GAACCCCACGGACATGGACCCCACACTGGAGGACCCCACCGCGCCC... | \n", "GAACCCCACGGACATGGACCCCACACTGGAGGACCCCACCGCGCCC... | \n", "1 | \n", "Common | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 2321468 | \n", "CAACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAA... | \n", "CAACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAA... | \n", "X | \n", "Pathogenic | \n", "
| 2321469 | \n", "ACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAACA... | \n", "ACAAGCATTTAAAAAGATGCTCAACTTATTAGAAATAAAAATAACA... | \n", "X | \n", "Pathogenic | \n", "
| 2321470 | \n", "ATAAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACC... | \n", "ATAAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACC... | \n", "X | \n", "Pathogenic | \n", "
| 2321471 | \n", "AAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACCTG... | \n", "AAAAATAACACAATGAAAGTAATGATGCAATATCACTCTACACCTG... | \n", "X | \n", "Pathogenic | \n", "
| 2321472 | \n", "GGTTCAGAAACCTGACTAAAGTTTGGTCAAACAGAGAATCTGTGTC... | \n", "GGTTCAGAAACCTGACTAAAGTTTGGTCAAACAGAGAATCTGTGTC... | \n", "Y | \n", "Pathogenic | \n", "
2321473 rows Γ 4 columns
\n", "