| # data discovery & verification protocol | |
| ## purpose | |
| To establish a rigorous, reproducible process for exploring, verifying, and documenting external data sources (Hugging Face Datasets, BIDS repos, etc.) before integrating them into the production codebase. This prevents "schema guessing" and ensures strict typing aligns with reality. | |
| ## principles | |
| 1. **No Assumptions**: Never assume column names, file formats, or data types. Verify them programmatically. | |
| 2. **Isolation**: Discovery scripts and their outputs must be isolated from production code and source control. | |
| 3. **Reproducibility**: The discovery process must be scriptable and reproducible, not a series of manual CLI commands. | |
| ## standard locations | |
| ### scripts | |
| All discovery logic resides in: | |
| ``` | |
| scripts/discovery/ | |
| βββ __init__.py | |
| βββ inspect_hf_dataset.py # e.g., Generic HF inspector | |
| βββ verify_bids_layout.py # e.g., BIDS validator | |
| βββ ... | |
| ``` | |
| ### data & artifacts | |
| All downloaded samples, temporary outputs, and schema reports reside in: | |
| ``` | |
| data/ | |
| βββ isles24/ # Extracted ISLES24 data (IGNORED) | |
| βββ discovery/ # Schema reports, samples (IGNORED) | |
| ``` | |
| ## discovery workflow | |
| ### 1. implementation | |
| Write a focused script in `scripts/discovery/` that: | |
| - Connects to the data source (e.g., HF Hub). | |
| - Fetches *metadata* or a *minimal sample* (streaming mode preferred). | |
| - Prints/Logs: | |
| - Feature keys (column names). | |
| - Data types (Arrow types, Python types). | |
| - Non-null counts (if feasible). | |
| - A sample row structure. | |
| ### 2. execution | |
| Run the script from the project root: | |
| ```bash | |
| uv run scripts/discovery/inspect_hf_dataset.py > data/discovery/schema_report.txt | |
| ``` | |
| ### 3. verification | |
| Manually review `data/discovery/schema_report.txt`. | |
| - **Check**: Do column names match `CaseAdapter` expectations? | |
| - **Check**: Are file paths strings or objects? | |
| - **Check**: Are required fields (DWI, ADC) actually present? | |
| ### 4. remediation | |
| If the report contradicts the code/specs: | |
| 1. Update the spec (`docs/specs/`) to reflect reality. | |
| 2. Update the code (`src/.../adapter.py`) to handle the actual schema. | |
| 3. Add a regression test if the edge case is complex. | |
| ## git configuration | |
| Ensure `.gitignore` includes: | |
| ```gitignore | |
| data/isles24/ | |
| data/discovery/ | |
| ``` | |