data discovery & verification protocol
purpose
To establish a rigorous, reproducible process for exploring, verifying, and documenting external data sources (Hugging Face Datasets, BIDS repos, etc.) before integrating them into the production codebase. This prevents "schema guessing" and ensures strict typing aligns with reality.
principles
- No Assumptions: Never assume column names, file formats, or data types. Verify them programmatically.
- Isolation: Discovery scripts and their outputs must be isolated from production code and source control.
- Reproducibility: The discovery process must be scriptable and reproducible, not a series of manual CLI commands.
standard locations
scripts
All discovery logic resides in:
scripts/discovery/
βββ __init__.py
βββ inspect_hf_dataset.py # e.g., Generic HF inspector
βββ verify_bids_layout.py # e.g., BIDS validator
βββ ...
data & artifacts
All downloaded samples, temporary outputs, and schema reports reside in:
data/
βββ isles24/ # Extracted ISLES24 data (IGNORED)
βββ discovery/ # Schema reports, samples (IGNORED)
discovery workflow
1. implementation
Write a focused script in scripts/discovery/ that:
- Connects to the data source (e.g., HF Hub).
- Fetches metadata or a minimal sample (streaming mode preferred).
- Prints/Logs:
- Feature keys (column names).
- Data types (Arrow types, Python types).
- Non-null counts (if feasible).
- A sample row structure.
2. execution
Run the script from the project root:
uv run scripts/discovery/inspect_hf_dataset.py > data/discovery/schema_report.txt
3. verification
Manually review data/discovery/schema_report.txt.
- Check: Do column names match
CaseAdapterexpectations? - Check: Are file paths strings or objects?
- Check: Are required fields (DWI, ADC) actually present?
4. remediation
If the report contradicts the code/specs:
- Update the spec (
docs/specs/) to reflect reality. - Update the code (
src/.../adapter.py) to handle the actual schema. - Add a regression test if the edge case is complex.
git configuration
Ensure .gitignore includes:
data/isles24/
data/discovery/