stroke-viewer-frontend / docs /specs /data-discovery.md
VibecoderMcSwaggins's picture
feat(phase-4): Gradio UI with NiiVue visualization (#5)
d77e99f unverified
|
raw
history blame
2.31 kB

data discovery & verification protocol

purpose

To establish a rigorous, reproducible process for exploring, verifying, and documenting external data sources (Hugging Face Datasets, BIDS repos, etc.) before integrating them into the production codebase. This prevents "schema guessing" and ensures strict typing aligns with reality.

principles

  1. No Assumptions: Never assume column names, file formats, or data types. Verify them programmatically.
  2. Isolation: Discovery scripts and their outputs must be isolated from production code and source control.
  3. Reproducibility: The discovery process must be scriptable and reproducible, not a series of manual CLI commands.

standard locations

scripts

All discovery logic resides in:

scripts/discovery/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ inspect_hf_dataset.py   # e.g., Generic HF inspector
β”œβ”€β”€ verify_bids_layout.py   # e.g., BIDS validator
└── ...

data & artifacts

All downloaded samples, temporary outputs, and schema reports reside in:

data/
β”œβ”€β”€ isles24/             # Extracted ISLES24 data (IGNORED)
└── discovery/           # Schema reports, samples (IGNORED)

discovery workflow

1. implementation

Write a focused script in scripts/discovery/ that:

  • Connects to the data source (e.g., HF Hub).
  • Fetches metadata or a minimal sample (streaming mode preferred).
  • Prints/Logs:
    • Feature keys (column names).
    • Data types (Arrow types, Python types).
    • Non-null counts (if feasible).
    • A sample row structure.

2. execution

Run the script from the project root:

uv run scripts/discovery/inspect_hf_dataset.py > data/discovery/schema_report.txt

3. verification

Manually review data/discovery/schema_report.txt.

  • Check: Do column names match CaseAdapter expectations?
  • Check: Are file paths strings or objects?
  • Check: Are required fields (DWI, ADC) actually present?

4. remediation

If the report contradicts the code/specs:

  1. Update the spec (docs/specs/) to reflect reality.
  2. Update the code (src/.../adapter.py) to handle the actual schema.
  3. Add a regression test if the edge case is complex.

git configuration

Ensure .gitignore includes:

data/isles24/
data/discovery/