File size: 1,149 Bytes
349aa7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# BioReasoning Data Curation

Jupyter notebooks for processing genetic variant data and creating ML datasets for biological reasoning tasks.

## Notebooks

**Core Analysis**
- `BioReasoning_DataCuration_KEGG.ipynb` - KEGG pathway analysis with Claude API
- `Clinvar_Coding.ipynb` - ClinVar variant processing and gene mapping
- `Clinvar_SNV_Non_SNV.ipynb` - SNV/structural variant datasets with VEP annotations

**KEGG Pipeline**  
- `KEGG_Data_1.ipynb` - KEGG network data processing and variant identification
- `KEGG_Data_2.ipynb` - Variant parsing and sequence generation
- `KEGG_Data_3.ipynb` - Final ML dataset creation with Q&A pairs

**Variant Prediction**
- `VEP.ipynb` - Variant effect prediction datasets (ClinVar, OMIM, eQTL)

## Setup

```bash
brew install brewsci/bio/edirect  # For ClinVar (macOS)
export ANTHROPIC_API_KEY="your-key"  # For KEGG analysis
```

## Usage

Each notebook has a configuration section - update paths/keys as needed, then run sequentially.

**Key Outputs:**
- KEGG biological reasoning datasets
- ClinVar variant-disease associations  
- VEP prediction task datasets
- Genomic sequences with variant context