nas / BioReason /data /README.md
yuccaaa's picture
Add files using upload-large-folder tool
acbfbc3 verified
# BioReasoning Data Curation
Jupyter notebooks for processing genetic variant data and creating ML datasets for biological reasoning tasks.
## Notebooks
**Core Analysis**
- `BioReasoning_DataCuration_KEGG.ipynb` - KEGG pathway analysis with Claude API
- `Clinvar_Coding.ipynb` - ClinVar variant processing and gene mapping
- `Clinvar_SNV_Non_SNV.ipynb` - SNV/structural variant datasets with VEP annotations
**KEGG Pipeline**
- `KEGG_Data_1.ipynb` - KEGG network data processing and variant identification
- `KEGG_Data_2.ipynb` - Variant parsing and sequence generation
- `KEGG_Data_3.ipynb` - Final ML dataset creation with Q&A pairs
**Variant Prediction**
- `VEP.ipynb` - Variant effect prediction datasets (ClinVar, OMIM, eQTL)
## Setup
```bash
brew install brewsci/bio/edirect # For ClinVar (macOS)
export ANTHROPIC_API_KEY="your-key" # For KEGG analysis
```
## Usage
Each notebook has a configuration section - update paths/keys as needed, then run sequentially.
**Key Outputs:**
- KEGG biological reasoning datasets
- ClinVar variant-disease associations
- VEP prediction task datasets
- Genomic sequences with variant context