| # BioReasoning Data Curation | |
| Jupyter notebooks for processing genetic variant data and creating ML datasets for biological reasoning tasks. | |
| ## Notebooks | |
| **Core Analysis** | |
| - `BioReasoning_DataCuration_KEGG.ipynb` - KEGG pathway analysis with Claude API | |
| - `Clinvar_Coding.ipynb` - ClinVar variant processing and gene mapping | |
| - `Clinvar_SNV_Non_SNV.ipynb` - SNV/structural variant datasets with VEP annotations | |
| **KEGG Pipeline** | |
| - `KEGG_Data_1.ipynb` - KEGG network data processing and variant identification | |
| - `KEGG_Data_2.ipynb` - Variant parsing and sequence generation | |
| - `KEGG_Data_3.ipynb` - Final ML dataset creation with Q&A pairs | |
| **Variant Prediction** | |
| - `VEP.ipynb` - Variant effect prediction datasets (ClinVar, OMIM, eQTL) | |
| ## Setup | |
| ```bash | |
| brew install brewsci/bio/edirect # For ClinVar (macOS) | |
| export ANTHROPIC_API_KEY="your-key" # For KEGG analysis | |
| ``` | |
| ## Usage | |
| Each notebook has a configuration section - update paths/keys as needed, then run sequentially. | |
| **Key Outputs:** | |
| - KEGG biological reasoning datasets | |
| - ClinVar variant-disease associations | |
| - VEP prediction task datasets | |
| - Genomic sequences with variant context | |