yuccaaa's picture
Add files using upload-large-folder tool
349aa7a verified

BioReasoning Data Curation

Jupyter notebooks for processing genetic variant data and creating ML datasets for biological reasoning tasks.

Notebooks

Core Analysis

  • BioReasoning_DataCuration_KEGG.ipynb - KEGG pathway analysis with Claude API
  • Clinvar_Coding.ipynb - ClinVar variant processing and gene mapping
  • Clinvar_SNV_Non_SNV.ipynb - SNV/structural variant datasets with VEP annotations

KEGG Pipeline

  • KEGG_Data_1.ipynb - KEGG network data processing and variant identification
  • KEGG_Data_2.ipynb - Variant parsing and sequence generation
  • KEGG_Data_3.ipynb - Final ML dataset creation with Q&A pairs

Variant Prediction

  • VEP.ipynb - Variant effect prediction datasets (ClinVar, OMIM, eQTL)

Setup

brew install brewsci/bio/edirect  # For ClinVar (macOS)
export ANTHROPIC_API_KEY="your-key"  # For KEGG analysis

Usage

Each notebook has a configuration section - update paths/keys as needed, then run sequentially.

Key Outputs:

  • KEGG biological reasoning datasets
  • ClinVar variant-disease associations
  • VEP prediction task datasets
  • Genomic sequences with variant context